Dear VASP developers,
I have recently encountered strange problems using VASP with KPAR flag switched on. Average of 70% of jobs hangs or crashes. Some of them (usually not many) return PMPI_Allreduce errors as shown below:
Code: Select all
MPICH ERROR [Rank 16] [job id 9854754.0] [Sun Mar 9 23:26:34 2025] [nid001712] - Abort(5838991) (rank 16 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack:
PMPI_Allreduce(523).........................: MPI_Allreduce(sbuf=0x7ffc66783650, rbuf=0x91fbcc0, count=2, datatype=MPI_DOUBLE_PRECISION, op=MPI_MAX, comm=comm=0xc4000004) failed
PMPI_Allreduce(508).........................:
MPIR_CRAY_Allreduce(577)....................:
MPIR_Allreduce_impl(352)....................:
MPIR_Allreduce_intra_auto(264)..............:
MPIR_Allreduce_intra_recursive_doubling(192):
MPIC_Sendrecv(341)..........................:
MPIC_Wait(71)...............................:
MPIR_Wait_impl(41)..........................:
MPID_Progress_wait(201).....................:
MPIDI_Progress_test(97).....................:
MPIDI_OFI_handle_cq_error(1067).............: OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Message too long - OK)
Hanging/failing happens at random moments, but often when "final diagonalization" is printed to stdout. After contacting with user support they suggested to switch off MPI collectives:
MPICH_COLL_OPT_OFF=1 and MPICH_SHARED_MEM_COLL_OPT=0 but it didn' help. I even tried to switch from cray-mpich to OpenMPI but the result is the same.
Right now I'm testing build without
Code: Select all
-Duse_collective
, maybe it will help.
Wihtout KPAR everything works good.
List of loaded modules during compiling:
Code: Select all
1) perftools-base/24.03.0 6) craype-x86-milan 11) partition/C (S) 16) cray-dsmml/0.3.0 21) libxc/7.0.0-cpeGNU-24.03-nofhc
2) ModuleLabel/label (S) 7) craype-accel-host 12) PrgEnv-gnu/8.5.0 17) gcc-native/13.2 22) cray-fftw/3.3.10.7
3) lumi-tools/24.05 (S) 8) libfabric/1.15.2.0 13) craype/2.7.31.11 18) cpeGNU/24.03 23) cray-hdf5-parallel/1.12.2.11
4) init-lumi/0.2 (S) 9) craype-network-ofi 14) cray-mpich/8.1.29 19) Wannier90/3.1.0-cpeGNU-24.03 24) VASP/6.4.3-cpeGNU-24.03-sol-vtst-occmat-stopcar-nocollectives
5) LUMI/24.03 (S) 10) xpmem/2.8.2-1.0_5.1__g84a27a5.shasta 15) cray-libsci/24.03.0 20) DFTD4/3.4.0-cpeGNU-24.03
Have You ever encountered such a problem?