VASP MPICH error with KPAR switched on

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
leszek_nowakowski
Newbie
Newbie
Posts: 12
Joined: Fri Mar 15, 2024 10:35 am

VASP MPICH error with KPAR switched on

#1 Post by leszek_nowakowski » Tue Apr 08, 2025 6:27 pm

Dear VASP developers,

I have recently encountered strange problems using VASP with KPAR flag switched on. Average of 70% of jobs hangs or crashes. Some of them (usually not many) return PMPI_Allreduce errors as shown below:

Code: Select all

MPICH ERROR [Rank 16] [job id 9854754.0] [Sun Mar  9 23:26:34 2025] [nid001712] - Abort(5838991) (rank 16 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack:
PMPI_Allreduce(523).........................: MPI_Allreduce(sbuf=0x7ffc66783650, rbuf=0x91fbcc0, count=2, datatype=MPI_DOUBLE_PRECISION, op=MPI_MAX, comm=comm=0xc4000004) failed
PMPI_Allreduce(508).........................: 
MPIR_CRAY_Allreduce(577)....................: 
MPIR_Allreduce_impl(352)....................: 
MPIR_Allreduce_intra_auto(264)..............: 
MPIR_Allreduce_intra_recursive_doubling(192): 
MPIC_Sendrecv(341)..........................: 
MPIC_Wait(71)...............................: 
MPIR_Wait_impl(41)..........................: 
MPID_Progress_wait(201).....................: 
MPIDI_Progress_test(97).....................: 
MPIDI_OFI_handle_cq_error(1067).............: OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Message too long - OK)

Hanging/failing happens at random moments, but often when "final diagonalization" is printed to stdout. After contacting with user support they suggested to switch off MPI collectives:
MPICH_COLL_OPT_OFF=1 and MPICH_SHARED_MEM_COLL_OPT=0 but it didn' help. I even tried to switch from cray-mpich to OpenMPI but the result is the same.
Right now I'm testing build without

Code: Select all

-Duse_collective

, maybe it will help.

Wihtout KPAR everything works good.

List of loaded modules during compiling:

Code: Select all

  1) perftools-base/24.03.0       6) craype-x86-milan                      11) partition/C         (S)  16) cray-dsmml/0.3.0              21) libxc/7.0.0-cpeGNU-24.03-nofhc
  2) ModuleLabel/label      (S)   7) craype-accel-host                     12) PrgEnv-gnu/8.5.0         17) gcc-native/13.2               22) cray-fftw/3.3.10.7
  3) lumi-tools/24.05       (S)   8) libfabric/1.15.2.0                    13) craype/2.7.31.11         18) cpeGNU/24.03                  23) cray-hdf5-parallel/1.12.2.11
  4) init-lumi/0.2          (S)   9) craype-network-ofi                    14) cray-mpich/8.1.29        19) Wannier90/3.1.0-cpeGNU-24.03  24) VASP/6.4.3-cpeGNU-24.03-sol-vtst-occmat-stopcar-nocollectives
  5) LUMI/24.03             (S)  10) xpmem/2.8.2-1.0_5.1__g84a27a5.shasta  15) cray-libsci/24.03.0      20) DFTD4/3.4.0-cpeGNU-24.03

Have You ever encountered such a problem?

You do not have the required permissions to view the files attached to this post.

michael_wolloch
Global Moderator
Global Moderator
Posts: 143
Joined: Tue Oct 17, 2023 10:17 am

Re: VASP MPICH error with KPAR switched on

#2 Post by michael_wolloch » Wed Apr 09, 2025 12:46 pm

Hi Leszek Nowakowski,

We are not routinely testing our code using the cray compilers, so it is possible that there is a problem with KPAR and that compiler. I have not done much work on LUMI, but I will try to reproduce your error and hopefully find a solution.

One quick thing that will probably not help, but I wanted to mention:
In your Stdout_vasp file, a bunch of modules seem to get replaced at the beginning of your run. cray-mpich is one of them. Could that be the root of the issue? Have you tried compiling the code with the reloaded versions of the modules, which seem to be less modern ones across the board?

I will get back to you soon with some updates, I hope.
Until then all the best,
Michael


leszek_nowakowski
Newbie
Newbie
Posts: 12
Joined: Fri Mar 15, 2024 10:35 am

Re: VASP MPICH error with KPAR switched on

#3 Post by leszek_nowakowski » Wed Apr 09, 2025 3:08 pm

Hi Michael Wolloch,

Thank You for the quick respond.
I compiled different versions with newest cray-mpich (and even with OpenMPI as mentioned before), but the problem persists. I also thought that maybe compiling with VTST and occupational matrix can be the issue, so I made a plain VASP installation (only with small POTCAR I/O improvements patch written by LUMI team) and calculations still stuck.

I have just realized that I attached the wrong set of files - there's no MPICH error at the end of Stdout_vasp. The appropiate files are below.

Best Regards,
Leszek

You do not have the required permissions to view the files attached to this post.

michael_wolloch
Global Moderator
Global Moderator
Posts: 143
Joined: Tue Oct 17, 2023 10:17 am

Re: VASP MPICH error with KPAR switched on

#4 Post by michael_wolloch » Fri Apr 11, 2025 10:28 am

Dear Leszek Nowakowski,

I have compiled VASP 6.4.3 myself on LUMI using CEE 17. I activated OMP and HDF5, but no other thrills and wistles. I loaded these modules:

Code: Select all

module load LUMI/24.03 partition/C
ml load cray-fftw/3.3.10.7 cray-hdf5/1.12.2.11

and ended up with this list of modules:

Code: Select all

Currently Loaded Modules:
  1) perftools-base/24.03.0   4) cray-dsmml/0.3.0      7) PrgEnv-cray/8.5.0      10) init-lumi/0.2    (S)  13) craype-accel-host   16) xpmem/2.8.2-1.0_5.1__g84a27a5.shasta      19) cray-hdf5/1.12.2.11
  2) cce/17.0.1               5) cray-mpich/8.1.29     8) ModuleLabel/label (S)  11) LUMI/24.03       (S)  14) libfabric/1.15.2.0  17) partition/C                          (S)
  3) craype/2.7.31.11         6) cray-libsci/24.03.0   9) lumi-tools/24.05  (S)  12) craype-x86-milan      15) craype-network-ofi  18) cray-fftw/3.3.10.7
  

I could reproduce your issue with KPAR, even though I had a slightly different setup and ran with only one thread per MPI rank!

I also ran the fast testsuite, on 4 MPI ranks and with 4 OpenMP threads per rank. And there all the tests pass, even those with KPAR active. I guess this was expected, since the issue only occurs after considerable time, and the tests are all relatively quick.

I will run the full testsuite over the weekend, just to be sure that there are no other issues, and since we have no cray toolchain in our CI.

Since this might be very hard to diagnose, is there a particular reason why you are using Cray compilers in particular? Have you tried an Intel or AMD toolchain on LUMI, and did you experience the same issues there?
I spoke with a colleague of mine who used cray machines in the past, and he reported having a lot fewer troubles with VASP when he used Intel toolchains, and also a noticeable speedup. This was a couple of years ago, however, so this information might be outdated.

I will update you once I have more to report,
cheers, Michael


leszek_nowakowski
Newbie
Newbie
Posts: 12
Joined: Fri Mar 15, 2024 10:35 am

Re: VASP MPICH error with KPAR switched on

#5 Post by leszek_nowakowski » Fri Apr 11, 2025 6:35 pm

Dear Michael,

Thank You very much for tests. I'm glad You could reproduce this bug.

I use CCE since this is the default environment recommender by LUST, and they have prepared a easyconfig file : https://lumi-supercomputer.github.io/LU ... cs/v/VASP/

Since - as You wrote - it could be hard to diagnose, I will switch the compiler. I think using Intel compilers on AMD cpus may not be a perfect optimization idea, so I will try to use aocc (minding the issues with OpenMP mentioned in https://www.vasp.at/wiki/index.php/Toolchains ) with cray libraries. I will test the installation and let You and LUMI team know about the issuses.

Thank You for Your help and I'm waiting for the news with full testsuite.

Best Regards,
Leszek


leszek_nowakowski
Newbie
Newbie
Posts: 12
Joined: Fri Mar 15, 2024 10:35 am

Re: VASP MPICH error with KPAR switched on

#6 Post by leszek_nowakowski » Sun Apr 13, 2025 12:17 pm

Dear Michael,

Sorry fot the double post but I can't edit previous one until it is accepeted, and I have very important update.

Actually, when compiling VASP with the eb file, EasyBuild uses cpeGNU toolchain, so after loading all the necessary modules:

Code: Select all

module load LUMI/24.03 partition/C cpeGNU

it turns out that compilers are GNU, not cray:

Code: Select all

> cc --version
gcc-13 (SUSE Linux) 13.2.1 20230912 [revision b96e66fd4ef3e36983969fb8cdd1956f551a074b]
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Code: Select all

> mpif90 --version
GNU Fortran (SUSE Linux) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So the compiler may not be the issue. Neither is MPI, since compiling with cray-mpich and OpenMPI in both cases gives errors. So, up to date, both of us have checked:
cray compilers + cray-mpich
GNU compilers + OpenMPI
GNU compilers + cray-mpich

And all of these are failing.


michael_wolloch
Global Moderator
Global Moderator
Posts: 143
Joined: Tue Oct 17, 2023 10:17 am

Re: VASP MPICH error with KPAR switched on

#7 Post by michael_wolloch » Mon Apr 14, 2025 3:37 pm

Dear Lezek,

Thanks for the further testing. Very interesting. I am now trying to reproduce the issue on our machines with the GNU compiler.

The full testsuite ran rather smoothly with CCE 17, only two tests failed with segfaults: C_2x2x2_RPAFORCE and ML_ZrO_ISTART3. I did exclude Wannier90 and LibXC tests.
I will have to check what went wrong with those two, but it was not the MPI error.

I also restarted the failed calculation from the last CONTCAR, and it ran another 21 ionic steps before failing again with the same PMPI_Reduce issue.

Have you observed this error also in more well-behaved calculations that have better electronic convergence, even if they do many ionic steps?

I will report back tomorrow when I have the results from our local machine (also a AMD EPYC), both with 6.4.3 and 6.5.1.

Cheers, Michael


leszek_nowakowski
Newbie
Newbie
Posts: 12
Joined: Fri Mar 15, 2024 10:35 am

Re: VASP MPICH error with KPAR switched on

#8 Post by leszek_nowakowski » Mon Apr 14, 2025 8:25 pm

Dear Michael,

Right now I am working with cobalt spinel slab, and I recently started using KPAR so I don't have any experience with other structures calculated with KPAR.

Cheers,
Leszek


michael_wolloch
Global Moderator
Global Moderator
Posts: 143
Joined: Tue Oct 17, 2023 10:17 am

Re: VASP MPICH error with KPAR switched on

#9 Post by michael_wolloch » Wed Apr 16, 2025 8:57 am

Dear Lezek,

I think I solved it, and I am a bit annoyed with myself that it took me so long to realize it.
Most of the confusion was caused by inexperience with LUMI and the Cray programming environments. I did not realize that we were looking at a GNU compile and not a Cray compile until you pointed it out, even though it should have been clear from the compiler flags alone in your provided makefile.include.
Please compile again, lowering the optimization flag from O3 to O2:
So:

Code: Select all

OFLAG       = -O2

instead of:

Code: Select all

OFLAG       = -O3

I managed to run your example without any issues after compiling 6.4.2 with a very similar makefile.include and a very similar toolchain.

Code: Select all

ml load LUMI/24.03 partition/C
ml load PrgEnv-gnu/8.5.0
ml load cray-fftw/3.3.10.7 cray-hdf5/1.12.2.11

Code: Select all

Currently Loaded Modules:
  1) perftools-base/24.03.0       6) craype-x86-milan                      11) partition/C       (S)  16) cray-libsci/24.03.0
  2) ModuleLabel/label      (S)   7) craype-accel-host                     12) gcc-native/13.2        17) PrgEnv-gnu/8.5.0
  3) lumi-tools/24.05       (S)   8) libfabric/1.15.2.0                    13) craype/2.7.31.11       18) cray-fftw/3.3.10.7
  4) init-lumi/0.2          (S)   9) craype-network-ofi                    14) cray-dsmml/0.3.0       19) cray-hdf5/1.12.2.11
  5) LUMI/24.03             (S)  10) xpmem/2.8.2-1.0_5.1__g84a27a5.shasta  15) cray-mpich/8.1.29
  

My makefile.include is attached. We recommend OFLAG = -O2 or lower for all supported toolchains other than NEC (-O3) and nvhpc (-fast).
I also compiled an aocc version (with -O2) and that works as well.
I will still make some effort to get an actual cray compile to work, and will update you on my results.
I will also run a preliminary benchmark on both the gnu and the aocc compile using MPI+OpenMP.

Please update me as well if reducing the OFLAG worked for you.

Cheers, Michael

You do not have the required permissions to view the files attached to this post.

leszek_nowakowski
Newbie
Newbie
Posts: 12
Joined: Fri Mar 15, 2024 10:35 am

Re: VASP MPICH error with KPAR switched on

#10 Post by leszek_nowakowski » Fri Apr 18, 2025 10:39 am

Hello Michael,

I still have the same problem.

First, I tried to use my makefile.include and change only OFLAG = -O3 to OFLAg = -O2, and build VASP using EB. Then I saw our makefile.include are little bit different, so I used Yours (removing only profiling flags and adding Wannier, Libxc and vaspsol flags). It didn't work either, so i build VASP again solely with makefile.include You have provided, excluding EasyBuild, loading exactly the same modules You have provided and it still fails, with the same error.

What is your slurm batch script? Maybe there are some relevant differences.

Also, when I test the compilation against this error, I run 10 identical calculations because errors are throwed not in every case, so maybe You were just lucky? As far as I see 30% calcualtion pass this test

Cheers, Leszek


Post Reply