Strange results from Nvidia compiler: vasp 6.4.2

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
sergey_lisenkov1
Newbie
Newbie
Posts: 24
Joined: Tue Nov 12, 2019 7:55 am

Strange results from Nvidia compiler: vasp 6.4.2

#1 Post by sergey_lisenkov1 » Fri Sep 22, 2023 1:19 pm

Hello all,

We have 2 suits of compilers on our IBM Power 9 machine: gnu (11.2.1) and Nvidia HPC SDK (23.7 and earlier versions). I found out that vasp executable made from Nvidia compilers crashes on the first ionic step for many well known structures, while GNU compiled executable works fine. It is a CPU version.

nvhpc executable:

Code: Select all

POSCAR, INCAR and KPOINTS ok, starting setup
 FFT: planning ... GRIDC
 FFT: planning ... GRID_SOFT
 FFT: planning ... GRID
 WAVECAR not read
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
DAV:   1     0.173264571665E+04    0.17326E+04   -0.11788E+05  1920   0.130E+03
DAV:   2    -0.230358811178E+03   -0.19630E+04   -0.19301E+04  2360   0.312E+02
DAV:   3    -0.413055714576E+03   -0.18270E+03   -0.18196E+03  2112   0.105E+02
DAV:   4    -0.418346594421E+03   -0.52909E+01   -0.52754E+01  2304   0.184E+01
DAV:   5    -0.418504365022E+03   -0.15777E+00   -0.15760E+00  2368   0.314E+00
DAV:   6    -0.418510569381E+03   -0.62044E-02   -0.62019E-02  2384   0.609E-01
DAV:   7    -0.418510821741E+03   -0.25236E-03   -0.25230E-03  2104   0.119E-01
DAV:   8    -0.418510833627E+03   -0.11886E-04   -0.11878E-04  1352   0.277E-02
DAV:   9    -0.418510835590E+03   -0.19633E-05   -0.19605E-05  1248   0.114E-02
DAV:  10    -0.418510836385E+03   -0.79468E-06   -0.79410E-06  1248   0.694E-03    0.391E+01
DAV:  11    -0.409127465421E+03    0.93834E+01   -0.47181E+00  2496   0.598E+00    0.341E+01
DAV:  12    -0.600727414656E+04   -0.55981E+04   -0.25396E+04  2232   0.427E+02    0.753E+02
DAV:  13    -0.206374649373E+05   -0.14630E+05   -0.16714E+05  3616   0.617E+02    0.942E+02
DAV:  14    -0.199576022174E+04    0.18642E+05   -0.63744E+04  2680   0.473E+02    0.630E+02
DAV:  15    -0.240481014127E+04   -0.40905E+03   -0.32877E+04  3152   0.372E+02    0.254E+02
DAV:  16    -0.949437102878E+03    0.14554E+04   -0.12875E+04  2992   0.197E+02    0.256E+02
DAV:  17    -0.130140804480E+04   -0.35197E+03   -0.32329E+03  2296   0.226E+02    0.190E+02
DAV:  18    -0.802601320115E+03    0.49881E+03   -0.17270E+03  2664   0.133E+02    0.175E+02
DAV:  19    -0.110661176686E+05   -0.10264E+05   -0.38819E+04  3280   0.446E+02    0.135E+03
DAV:  20    -0.673739963064E+04    0.43287E+04   -0.34403E+04  2800   0.387E+02    0.703E+02
DAV:  21    -0.822818573836E+05   -0.75544E+05   -0.14020E+05  3792   0.129E+03    0.135E+03
 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     ERROR FEXCP: supplied Exchange-correletion table                        |
|      is too small, maximal index : 5237                                     |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------
gnu executable:

Code: Select all

POSCAR, INCAR and KPOINTS ok, starting setup
 FFT: planning ... GRIDC
 FFT: planning ... GRID_SOFT
 FFT: planning ... GRID
 WAVECAR not read
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
DAV:   1     0.173243421018E+04    0.17324E+04   -0.11787E+05  1920   0.130E+03
DAV:   2    -0.230370982792E+03   -0.19628E+04   -0.19299E+04  2360   0.312E+02
DAV:   3    -0.413047841306E+03   -0.18268E+03   -0.18194E+03  2112   0.105E+02
DAV:   4    -0.418338752370E+03   -0.52909E+01   -0.52755E+01  2304   0.184E+01
DAV:   5    -0.418496586483E+03   -0.15783E+00   -0.15766E+00  2368   0.314E+00
DAV:   6    -0.418502797412E+03   -0.62109E-02   -0.62085E-02  2384   0.609E-01
DAV:   7    -0.418503050125E+03   -0.25271E-03   -0.25265E-03  2104   0.119E-01
DAV:   8    -0.418503062029E+03   -0.11905E-04   -0.11897E-04  1352   0.277E-02
DAV:   9    -0.418503063997E+03   -0.19678E-05   -0.19648E-05  1248   0.114E-02
DAV:  10    -0.418503064795E+03   -0.79736E-06   -0.79617E-06  1248   0.695E-03    0.391E+01
DAV:  11    -0.409122240809E+03    0.93808E+01   -0.47174E+00  2496   0.598E+00    0.341E+01
DAV:  12    -0.393017308166E+03    0.16105E+02   -0.78579E+01  2288   0.241E+01    0.196E+01
DAV:  13    -0.392867053009E+03    0.15026E+00   -0.99992E+00  2112   0.885E+00    0.118E+01
DAV:  14    -0.392232402868E+03    0.63465E+00   -0.10163E+00  2336   0.289E+00    0.779E+00
DAV:  15    -0.391914142726E+03    0.31826E+00   -0.17821E+00  2128   0.290E+00    0.298E+00
DAV:  16    -0.391920664614E+03   -0.65219E-02   -0.15550E-01  2192   0.123E+00    0.146E+00
DAV:  17    -0.391913112667E+03    0.75519E-02   -0.11686E-01  2144   0.888E-01    0.119E+00
DAV:  18    -0.391901673271E+03    0.11439E-01   -0.41159E-02  2120   0.535E-01    0.471E-01
DAV:  19    -0.391902290698E+03   -0.61743E-03   -0.17945E-02  2256   0.305E-01    0.367E-01
DAV:  20    -0.391901664399E+03    0.62630E-03   -0.87993E-03  2128   0.221E-01    0.341E-01
DAV:  21    -0.391901405796E+03    0.25860E-03   -0.37255E-03  2224   0.169E-01    0.132E-01
DAV:  22    -0.391901429823E+03   -0.24027E-04   -0.12711E-03  2064   0.103E-01    0.990E-02
DAV:  23    -0.391901353266E+03    0.76557E-04   -0.28209E-04  1328   0.549E-02
   1 F= -.39959478E+03 E0= -.39959478E+03  d E =-.399595E+03
 curvature:   0.00 expect dE= 0.000E+00 dE for cont linesearch  0.000E+00
 trial: gam= 0.00000 g(F)=  0.611E-02 g(S)=  0.000E+00 ort = 0.000E+00 (trialstep = 0.100E+01)
 search vector abs. value=  0.611E-02
 reached required accuracy - stopping structural energy minimisation
 writing wavefunctions
I tried everything with Nvidia compilers - no optimization, different libraries. Nothing helps. What can be an issue?

Thanks,
Sergey

merzuk.kaltak
Administrator
Administrator
Posts: 282
Joined: Mon Sep 24, 2018 9:39 am

Re: Strange results from Nvidia compiler: vasp 6.4.2

#2 Post by merzuk.kaltak » Mon Sep 25, 2023 8:36 am

Dear Sergey,

please provide us some input and output files (preferably for a small system).
We would like to reproduce your problem.
Also, please let us know which libraries (and versions) you use to compile and link vasp to.
If possible, please upload also the makefile.include used.

sergey_lisenkov1
Newbie
Newbie
Posts: 24
Joined: Tue Nov 12, 2019 7:55 am

Re: Strange results from Nvidia compiler: vasp 6.4.2

#3 Post by sergey_lisenkov1 » Tue Oct 03, 2023 6:57 pm

Good evening,

I apologize for the late reply.

Please find attached the set of input files and makefile.include I used. The test file is not small, but it really takes 5 minutes to get this error.
I used lapack/blas/scalapack as shipped with Nvidia SDK set, and FFTW-3.3.10.
You do not have the required permissions to view the files attached to this post.

merzuk.kaltak
Administrator
Administrator
Posts: 282
Joined: Mon Sep 24, 2018 9:39 am

Re: Strange results from Nvidia compiler: vasp 6.4.2

#4 Post by merzuk.kaltak » Wed Oct 04, 2023 9:36 am

Dear Sergey,

Please upload the input files including OUTCAR of the successful run as well.

merzuk.kaltak
Administrator
Administrator
Posts: 282
Joined: Mon Sep 24, 2018 9:39 am

Re: Strange results from Nvidia compiler: vasp 6.4.2

#5 Post by merzuk.kaltak » Mon Oct 09, 2023 12:43 pm

Dear Sergey,

inspecting your makefile.include I found a few points that might be responsible for the issue.
You set a global -O2 optimization flag with following

Code: Select all

OFLAG       = -O2 -fast -Mcache_align
The recommended makefile.include for nvhpc typically have only a

Code: Select all

OFLAG = -fast
If possible, please use the latter.

Furthermore, it seems you have compiled your own scalapack with

Code: Select all

# BLAS (mandatory)
BLAS        = -lblas

# LAPACK (mandatory)
LAPACK      = -llapack

# scaLAPACK (mandatory)
SCALAPACK   =  -L$(HOME)/arch/nvhpc/scalapack-2.2.0-spectrum_mpi/lib/ -lscalapack -llapack  -lblas
We usually recommend to use the scalapack shipped with the hpc_sdk suite.

Last but not least and maybe most important.
It seems you compiled fftw with hpc_sdk and link your vasp executable to this library with

Code: Select all

# FFTW (mandatory)
FFTW_ROOT  ?= $(HOME)/arch/nvhpc/fftw-3.3.10-nv23.7/
If this is correct, then please try compiling fftw with gcc and link vasp to it.

sergey_lisenkov1
Newbie
Newbie
Posts: 24
Joined: Tue Nov 12, 2019 7:55 am

Re: Strange results from Nvidia compiler: vasp 6.4.2

#6 Post by sergey_lisenkov1 » Tue Oct 10, 2023 12:32 pm

Here is a good output: vasp compiled with GNU-11.2.1 and OpenMPI, OpenBLAS and FFTW compiled with gnu.
You do not have the required permissions to view the files attached to this post.

sergey_lisenkov1
Newbie
Newbie
Posts: 24
Joined: Tue Nov 12, 2019 7:55 am

Re: Strange results from Nvidia compiler: vasp 6.4.2

#7 Post by sergey_lisenkov1 » Tue Oct 10, 2023 1:43 pm

I think you are correct in your suggestion: it was caused by nvidia compiled FFTW. If I use GNU compiled FFTW, this error disappeared.

Thanks for your help!

Sergey

Post Reply