Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/psm3: update provider to sync with IEFS 11.6.0.0.231 #9878

Merged
merged 1 commit into from
Mar 21, 2024

Conversation

sjb017
Copy link
Contributor

@sjb017 sjb017 commented Mar 12, 2024

Updates:

  • Full support for Intel oneAPI DPC++/C++ compiler
  • Improved default tuning for Intel GPUs

@j-xiong
Copy link
Contributor

j-xiong commented Mar 12, 2024

The psm3 failures are real.

@j-xiong
Copy link
Contributor

j-xiong commented Mar 12, 2024

Failed test 1:

13:18:14  - name:   fi_multi_ep -e rdm -v --shared-av -p "psm3"
......
13:18:14    result: Fail
13:18:14    time:   2
13:18:14    server_cmd:  ....../fi_multi_ep -e rdm -v --shared-av -p "psm3"   -s xxx-ib0
13:18:14    server_stdout: |
......
13:18:14      fi_getinfo(): functional/multi_ep.c:317, ret=-61 (No data available)
13:18:14      Creating 3 EPs
......

Failed test 2:

13:18:14  - name:   fi_rma_bw -e rdm -o write -U -p "psm3"
......
13:18:14    result: Fail
13:18:14    time:   4
13:18:14    server_cmd:  ....../fi_rma_bw -e rdm -o write -U -p "psm3"   -s xxx-ib0
13:18:14    server_stdout: |
......
13:18:14    client_cmd:  ....../fi_rma_bw -e rdm -o write -U -p "psm3"   -s yyy-ib0 xxx-ib0
13:18:14    client_stdout: |
13:18:14      fi_getinfo(): common/shared.c:1047, ret=-61 (No data available)

Failed test 3:

13:18:14  - name:   fi_rma_bw -e rdm -o read -p "psm3"
......
13:18:14    result: Fail
13:18:14    time:   3
13:18:14    server_cmd:  ....../fi_rma_bw -e rdm -o read -p "psm3"   -s xxx-ib0
13:18:14    server_stdout: |
......
[error] fabtests:common/shared.c:2905: cq_readerr 13 (Permission denied), provider errno: 8 (PSM Unresolved internal error)
13:18:14    client_cmd:  ....../fi_rma_bw -e rdm -o read -p "psm3"   -s xxx-ib0 yyy-ib0
13:18:14    client_stdout: |
......
[error] fabtests:common/shared.c:2905: cq_readerr 13 (Permission denied), provider errno: 8 (PSM Unresolved internal error)
13:18:14      bash: line 1: 3736688 Segmentation fault      (core dumped) 

Failed test 4:

13:18:14  - name:   fi_rma_bw -e rdm -o read -U -p "psm3"
......
13:18:14    result: Fail
13:18:14    time:   301
13:18:14    server_cmd:  ....../fi_rma_bw -e rdm -o read -U -p "psm3"   -s xxx-ib0

Failed test 5:

13:18:14  - name:   fi_ubertest
13:18:14    timestamp: 20240312-201740+0000
13:18:14    result: Fail [/]
13:18:14    time:   3
13:18:14    server_cmd:  ....../fi_ubertest  -x 
13:18:14    server_stdout: |
......
13:18:14    client_cmd:  ....../fi_ubertest  -u all.test xxx-ib0 

@sjb017 sjb017 force-pushed the psm3-rel-11.6.0.0 branch from 241f82a to d70b24a Compare March 13, 2024 18:19
@zachdworkin
Copy link
Contributor

Please rebase to pick up Intel CI changes from #9886 for cuda stages to pass.

@sjb017 sjb017 force-pushed the psm3-rel-11.6.0.0 branch from d70b24a to 7ef789d Compare March 15, 2024 18:37
@sjb017
Copy link
Contributor Author

sjb017 commented Mar 15, 2024

Is there an AGS permission I need? When I click on "Details" and log on to Jenkins, I do not have permission to see the log.

@zachdworkin
Copy link
Contributor

zachdworkin commented Mar 15, 2024

Yes you will need AGS permission. Here are the logs in the meantime

PSM3 Failures

  • name: fi_multi_ep -e rdm -v --shared-av -p "psm3"
    timestamp: 20240315-191714+0000
    result: Fail
    time: 2
    server_cmd: /redacted_path/bin/fi_multi_ep -e rdm -v --shared-av -p "psm3" -s node_name
    server_stdout: |
    PSM3_IDENTIFY PSM3 v3.0 built for IEFS OFA DELTA 3_6_0_0
    PSM3_IDENTIFY location /redacted_path/lib/libfabric.so.1
    PSM3_IDENTIFY build date 2024-03-15T11:47:31
    PSM3_IDENTIFY src checksum redacted_checksum
    PSM3_IDENTIFY git checksum
    PSM3_IDENTIFY HAL: verbs (RDMA Verbs)
    PSM3_IDENTIFY Global Rank -1 (-1 total) Local Rank -1 (-1 total)
    PSM3_IDENTIFY CPU Core 0 NUMA 0
    PSM3_IDENTIFY NIC 0 (mlx5_0) Port 1 100000 Mbps NUMA 0 LID=3 GID=0xfe80000000000000:248a0703008ef618 QP=81221
    fi_getinfo(): functional/multi_ep.c:317, ret=-61 (No data available)
    Creating 3 EPs
    client_cmd: /redacted_path/bin/fi_multi_ep -e rdm -v --shared-av -p "psm3" -s node_name node_name
    client_stdout: |
    PSM3_IDENTIFY PSM3 v3.0 built for IEFS OFA DELTA 3_6_0_0
    PSM3_IDENTIFY location /redacted_path/lib/libfabric.so.1
    PSM3_IDENTIFY build date 2024-03-15T11:47:31
    PSM3_IDENTIFY src checksum redacted_checksum
    PSM3_IDENTIFY git checksum
    PSM3_IDENTIFY HAL: verbs (RDMA Verbs)
    PSM3_IDENTIFY Global Rank -1 (-1 total) Local Rank -1 (-1 total)
    PSM3_IDENTIFY CPU Core 0 NUMA 0
    PSM3_IDENTIFY NIC 0 (mlx5_0) Port 1 100000 Mbps NUMA 0 LID=11 GID=0xfe80000000000000:e41d2d0300f2a80c QP=215731
    fi_getinfo(): functional/multi_ep.c:317, ret=-61 (No data available)
    Creating 3 EPs

  • name: fi_ubertest
    timestamp: 20240315-191918+0000
    result: Fail [/]
    time: 3
    server_cmd: /redacted_path/bin/fi_ubertest -x
    server_stdout: |
    Starting test 1:
    PSM3_IDENTIFY PSM3 v3.0 built for IEFS OFA DELTA 3_6_0_0
    PSM3_IDENTIFY location /redacted_path/lib/libfabric.so.1
    PSM3_IDENTIFY build date 2024-03-15T11:47:31
    PSM3_IDENTIFY src checksum redacted_checksum
    PSM3_IDENTIFY git checksum
    PSM3_IDENTIFY HAL: verbs (RDMA Verbs)
    PSM3_IDENTIFY Global Rank -1 (-1 total) Local Rank -1 (-1 total)
    PSM3_IDENTIFY CPU Core 0 NUMA 0 PID redacted
    PSM3_IDENTIFY NIC 0 (mlx5_0) Port 1 100000 Mbps NUMA 0 LID=3 GID=0xfe80000000000000:248a0703008ef618 QP=81253

SHM Failures (dl build: They all timed out)
fi_rma_bw -e rdm -o write -i 5 -p "shm"
fi_rma_bw -e rdm -o write -i 5 -u -p "shm"
fi_rma_bw -e rdm -o read -i 5 -p "shm"
fi_rma_bw -e rdm -o read -i 5 -u -p "shm"
fi_rma_bw -e rdm -o writedata -i 5 -p "shm"
fi_rma_bw -e rdm -o writedata -i 5 -u -p "shm"
fi_rdm_atomic -i 5 -o all -p "shm"
fi_rdm_atomic -i 5 -o all -u -p "shm"
fi_dgram_pingpong -i 5 -p "shm"

UCX Failures (reg build: this is a known race condition)

  • name: fi_rdm_tagged_peek -p "ucx"
    timestamp: 20240315-190801+0000
    result: Fail
    time: 2
    server_cmd: /redacted_path/bin/fi_rdm_tagged_peek -p "ucx" -E
    server_stdout: |
    Sending 10 tagged messages
    Waiting for messages to complete
    [n1:redacted] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
    ==== backtrace (redacted) ====
    0 0x0000000000012cf0 __funlockfile() :0
    1 0x0000000000033210 ucp_ep_destroy_base() ???:0
    2 0x000000000004b3ee ucp_worker_discard_uct_ep_progress() ???:0
    3 0x000000000004b4b5 ucp_worker_destroy() ???:0
    4 0x00000000000c9dba ucx_ep_close() ucx_ep.c:0
    5 0x0000000000404071 fi_close() /redacted_path/reg/include/rdma/fabric.h:632
    6 0x0000000000404071 ft_close_fids() /redacted_path/libfabric/fabtests/common/shared.c:1776
    7 0x0000000000404b5a ft_free_res() /redacted_path/libfabric/fabtests/common/shared.c:1846
    8 0x0000000000401bee main() /redacted_path/libfabric/fabtests/functional/rdm_tagged_peek.c:363
    9 0x0000000000401bee main() /redacted_path/libfabric/fabtests/functional/rdm_tagged_peek.c:364
    10 0x000000000003ad85 __libc_start_main() ???:0
    11 0x000000000040202e _start() ???:0

    client_cmd: /redacted_path/bin/fi_rdm_tagged_peek -p "ucx" -E node_name
    client_stdout: |
    Peek for a bad msg
    Peek w/ claim for a bad msg
    Peek msg 1
    Receive msg 1
    Peek w/ claim msg 2
    Receive claimed msg 2
    Peek & discard msg 3
    Checking to see if msg 3 was discarded
    Peek w/ claim msg 4
    Claim and discard msg 4
    Receive msg 5
    Receive msg 6
    Receive msg 10
    Receive msg 9
    Receive msg 8
    Receive msg 7

Updates:
- Full support for Intel oneAPI DPC++/C++ compiler
- Improved default tuning for Intel GPUs

Signed-off-by: Scott Breyer <[email protected]>
@sjb017 sjb017 force-pushed the psm3-rel-11.6.0.0 branch from 7ef789d to eb774a8 Compare March 20, 2024 19:52
@j-xiong j-xiong merged commit acde37d into ofiwg:main Mar 21, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants