Page 1 of 1

UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Thu Jun 29, 2023 7:43 am
by amihai_silverman1
Hello,
I have compiled vasp.6.4.1 on our HPC cluster.
When I run the NaCl example (from the /testsuite/tests) I get an error in the output :

[1688024082.874154] [n017:228578:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16
Abort(1614991) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1901):
create_endpoint(2593)........: OFI endpoint open failed (ofi_init.c:2593:create_endpoint:Input/output error)

I am not sure if the problem is in the cluster or else in my compilation.
I will be grateful for your help.
Attached is the makefile.include which was used for the compilation, and a tar of the running folder with the PBS submit script and the input and output files.
Thank a lot
Amihai

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Thu Jun 29, 2023 9:02 am
by fabien_tran1
Hi,

Can you please provide information about the HPC cluster like the processor type and RAM?

Is the error occurring for all examples that you have tried or only NaCl?

A google search of " UCX ERROR Invalid active_width" provides only
https://github.com/openucx/ucx/issues/4556
https://forums.developer.nvidia.com/t/u ... 7-9/206236
Have you tried the possible solutions provided there?

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Thu Jun 29, 2023 10:05 am
by amihai_silverman1
Hi,
The same error occurs in every run. That is why I tried a simple test.
The cluster has Lenovo compute-nodes with processors : Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
384G RAM in each node, and InfiniBand network.
I compiled with the Intel compiler oneapi2022.

Regarding the links you provided, I observe that in my case, command
ucx_info -d
doesn't show any error. Its output is attached.
Thanks, Amihai

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Thu Jun 29, 2023 11:38 am
by fabien_tran1
Is the error also occurring in non-parallel calculation with "mpirun -np 1"? Are you running with OpenMP, i.e., with OMP_NUM_THREADS>1?

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Sun Jul 02, 2023 9:23 am
by amihai_silverman1
Yes, the run failed also with "mpirun -np 1".
It also failed with the same error when I put
export OMP_NUM_THREADS=12
mpirun -np 12 ...

I thought that it is a problem of the Intel oneapi compiler, so I downloaded and installed the latest version, and recompiled VASP with it,
It didn't solve the problem, the run fails with the same error.
Thanks, Amihai

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Sun Jul 02, 2023 9:38 am
by amihai_silverman1
Please see the output of ucx_info -d. It looks like Device: mlx5_0:1 is OK :
#
# Memory domain: self
# component: self
# register: unlimited, cost: 0 nsec
# remote key: 8 bytes
#
# Transport: self
#
# Device: self
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8k
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
#
#
# Memory domain: tcp
# component: tcp
#
# Transport: tcp
#
# Device: eno6
#
# capabilities:
# bandwidth: 1131.64 MB/sec
# latency: 5258 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: eno2
#
# capabilities:
# bandwidth: 113.16 MB/sec
# latency: 5776 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: ib0
#
# capabilities:
# bandwidth: 5571.26 MB/sec
# latency: 5212 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
#
# Memory domain: ib/mlx5_0
# component: ib
# register: unlimited, cost: 90 nsec
# remote key: 16 bytes
# local memory handle is required for zcopy
#
# Transport: rc
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 3 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 3 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 123
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 2 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 127
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: peer failure
#
#
# Transport: rc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 40 nsec
# put_short: <= 220
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 235
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 187
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: dc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 172
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 187
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 139
# domain: device
# connection: to iface
# priority: 30
# device address: 3 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: ud
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 1 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 3984
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Transport: ud_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 132
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Memory domain: rdmacm
# component: rdmacm
# supports client-server connection establishment via sockaddr
# < no supported devices found >
#
# Memory domain: sysv
# component: sysv
# allocate: unlimited
# remote key: 32 bytes
#
# Transport: mm
#
# Device: sysv
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: posix
# component: posix
# allocate: unlimited
# remote key: 37 bytes
#
# Transport: mm
#
# Device: posix
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: cma
# component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
#
# Device: cma
#
# capabilities:
# bandwidth: 11145.00 MB/sec
# latency: 80 nsec
# overhead: 400 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 4 bytes
# error handling: none
#

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Sun Jul 02, 2023 9:38 am
by fabien_tran1
Hi,

I will ask my colleagues if they have an idea of what could be the problem. Meanwhile you could try what is mentioned in the last post of
https://forums.developer.nvidia.com/t/u ... 9/206236/4
where it is suggested to add "soft memlock unlimited" and "hard memlock unlimited" in /etc/security/limits.d/rdma.conf. I guess it is not related and won't help, but try it just in case.

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Sun Jul 02, 2023 9:56 am
by amihai_silverman1
Please see, may help to debug the problem :

$ lspci | grep Mellanox
12:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$ ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.22.4030
node_guid: 0409:73ff:ffe1:bbd8
sys_image_guid: 0409:73ff:ffe1:bbd8
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2180110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 98
port_lid: 62
port_lmc: 0x00
link_layer: InfiniBand
$ fi_info -l
psm2:
version: 1.7
psm:
version: 1.7
usnic:
version: 1.0
ofi_rxm:
version: 1.0
ofi_rxd:
version: 1.0
verbs:
version: 1.0
UDP:
version: 1.1
sockets:
version: 2.0
tcp:
version: 0.1
ofi_perf_hook:
version: 1.0
ofi_noop_hook:
version: 1.0
shm:
version: 1.0
ofi_mrail:
version: 1.0

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Mon Jul 03, 2023 7:12 am
by fabien_tran1
Hi,

For running a calculation, do you load the required Intel modules or set the environment variables correctly?
Does the problem also occur if OpenMP is switched off with "export OMP_NUM_THREADS=1"?

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

Posted: Tue Jul 04, 2023 5:58 am
by amihai_silverman1
Hi,
Thank you for your replies, I think that the problem is with the compute-nodes. I have asked the cluster support to check that with the cluster integrator.