Page 1 of 1

VASP 6.3.0 loop time is slower than 6.2.1 on specific node architecture for a hybrid calculation

Posted: Mon May 16, 2022 4:48 pm
by guyohad
Dear VASP developers,

I was comparing the loop time of 6.3.0 with 6.2.1 for a hybrid calculation, upon varying the number of nodes and number of mpiprocs per node, using omp.
I was varying the number of nodes between 1 and 8, keeping the number of mpiprocs per node (=4) and and OMP_NUM_THREADS (=6) fixed. I was also varying the number of mpiprocs per node, keeping the number of nodes (=4) fixed and OMP_NUM_THREADS=24/(number of mpiprocs per node).

The loop time is identical between the two versions for all cases but one: 4 nodes, 4 mpiprocs per node, OMP_NUM_THREADS=6. For this architecture, 6.3.0 is three times slower than 6.2.1 (loop time 30 seconds vs 10 seconds). The nodes used have 24 cpus per node.

Going over the OUTCARs of all calculations, I have noticed that only for the mentioned node architecture, the default value chosen for APACO is 10.0 in 6.3.0 vs 16.0 in 6.2.1. For all other cases, APACO was automatically set to 16.0 in both versions. However, when setting this value to be 16.0 in the INCAR of 6.3.0 for the problematic case, the loop time still remains unchanged.

Both versions are compiled with openmp with the exact same makefile.include (attached).
I also attach the input files and the OUTCARs of the two calculations.

I appreciate any help identifying the source of this problem.
Sincerely,
Guy

Re: VASP 6.3.0 loop time is slower than 6.2.1 on specific node architecture for a hybrid calculation

Posted: Wed May 25, 2022 9:48 am
by andreas.singraber
Hello Guy,

thank you very much for your detailed report and sorry it took me a while to reply... I will now have a closer look and report back.

Best,
Andreas

Re: VASP 6.3.0 loop time is slower than 6.2.1 on specific node architecture for a hybrid calculation

Posted: Wed May 25, 2022 3:21 pm
by andreas.singraber
Hello again!

I tried to reproduce your findings but I have not found such a large difference yet... I will continue my tests. However, I already have some preliminary comments and questions:

1.) Your comment about the APACO default settings and the fact that you observe the degraded performance only for a specific MPI/OpenMP combinations make me wonder a bit. I can confirm that the default value for VASP 6.2.1 was 16.0, the new default in 6.3.0 is 10.0. So I would assume that all OUTCARs with default APACO = 16.0 were run with the VASP 6.2.1 executable. Can you maybe check that again?

2.) Hybrid parallelization with MPI/OpenMP can be quite tricky to set up for optimal performance as documented on our Wiki page. Did you use processor pinning as suggested there? I cannot see any corresponding settings (OMP_PLACES, OMP_PROC_BIND, I_MPI_PIN, ...) in your submit script, are they somewhere else defined? As far as I know the way OpenMP is handled internally was changed from 6.2.1 to 6.3.0 so maybe the pinning is now more important and may explain timing differences. I will ask my colleague about the details.

3.) Could you tell me the actual node hardware you are using? Maybe you could post the output of the commands

Code: Select all

lscpu
and

Code: Select all

numactl --hardware
if they are available?

Thank you!

All the best,
Andreas

Re: VASP 6.3.0 loop time is slower than 6.2.1 on specific node architecture for a hybrid calculation

Posted: Thu May 26, 2022 11:14 am
by guyohad
Dear Andreas,

Thank you very much for your reply. To address your questions:

1.) You are absolutely right. I was using the wrong executable for some of the calculations. Sorry about that. I reran everything again with the correct executable. In this light, the problem can be rephrased.
The problem is more general than I previously thought, and is not related to a specific node architecture. In 6.2.1 the loop time is pretty much the same upon varying the number of mpiprocs and keeping OMP_NUM_THREADS = 24/(number of mpiprocs per node) (~10-15 seconds). In 6.3.0, however, the loop time increases with decreasing the number of mpiprocs. I think this is indeed an indication that one needs to handle openmp differently to get the same performances as in 6.2.1, as you mentioned. This takes me to your next question.

2.) I tried adding the following lines to my job script:
export OMP_PLACES=cores
export OMP_PROC_BIND=close
but there was no change in loop time. I will try to look into it more.

3.) Here is the information you requested about the nodes I am using:

Code: Select all

[cfa091][]$lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2499.975
CPU max MHz:           2900.0000
CPU min MHz:           1200.0000
BogoMIPS:              4394.69
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-2,12-14
NUMA node1 CPU(s):     3-5,15-17
NUMA node2 CPU(s):     6-8,18-20
NUMA node3 CPU(s):     9-11,21-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
and

Code: Select all

[cfa091][~]$numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 12 13 14
node 0 size: 31883 MB
node 0 free: 13818 MB
node 1 cpus: 3 4 5 15 16 17
node 1 size: 32254 MB
node 1 free: 20019 MB
node 2 cpus: 6 7 8 18 19 20
node 2 size: 32254 MB
node 2 free: 21040 MB
node 3 cpus: 9 10 11 21 22 23
node 3 size: 32238 MB
node 3 free: 20668 MB
node distances:
node   0   1   2   3
  0:  10  21  31  31
  1:  21  10  31  31
  2:  31  31  10  21
  3:  31  31  21  10
Best wishes,
Guy

Re: VASP 6.3.0 loop time is slower than 6.2.1 on specific node architecture for a hybrid calculation

Posted: Wed Jun 08, 2022 8:45 pm
by andreas.singraber
Hello Guy,

we had a closer look at this and it turns out that the change in internal OpenMP handling is indeed the origin of the performance loss in this particular setup. Here is a little background story: before 6.3.0 there were multiple places in the VASP source code with nested OpenMP regions. For example, there could be an OpenMP parallelized outer loop over bands and inside the loop body there could be calls to the FFT library which itself could spawn new OpenMP threads. In such situations one can easily run into oversubscription of available CPU cores. To avoid that some OpenMP implementations disable spawning of new threads inside a parallel region by default. However, the actual behavior is vendor-specific and so it is hard for us to control and maintain nested OpenMP parallelization. Hence, it was decided that nested OpenMP regions should be avoided in the future and removed from existing code. Preferably, threaded work would be delegated to libraries, e.g. to the FFT library. Along with these code changes benchmarks were performed which did not show any degradation of performance for our test cases. Unfortunately, it seems your setup has triggered such a case where the original OpenMP code parts are more performant. We will work on this issue and hopefully have a solution in our next release that will avoid nested OpenMP regions but still get back the pre-6.3.0 performance in your setup.

Here is the good news: until we have a solution there is a very simple code modification that should bring back the performance comparable to 6.2.1: In the file src/fock_dbl.F of VASP 6.3.0 go to lines 1237 and 1287 and remove the first two exclamation marks. Each line should now start with only one "!", i.e., line 1237 then should be:

Code: Select all

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(N,EXX,NS,FD) REDUCTION(+:EXHF) IF (MOD(NBLK,omp_nthreads)==0)
and line 1287:

Code: Select all

!$OMP END PARALLEL DO
Upon recompilation of VASP this will re-enable the original nested code version and should restore the expected performance. Do not worry, oversubscription will also not occur in your Intel tool chain setup. I hope this fix works for you too, please report back your findings! Thank you very much for bringing this up and making us aware of this performance problem!

All the best,

Andreas Singraber

Re: VASP 6.3.0 loop time is slower than 6.2.1 on specific node architecture for a hybrid calculation

Posted: Thu Jun 09, 2022 6:55 am
by guyohad
Dear Andreas,

Your solution seems to fix the problem!
I highly appreciate your thorough examination of the issue and the detailed explanation.

All the best,
Guy

Re: VASP 6.3.0 loop time is slower than 6.2.1 on specific node architecture for a hybrid calculation

Posted: Wed Aug 09, 2023 11:23 am
by guyohad
Dear Andreas,

Unless I missed something, it looks like this issue hasn't been fixed in VASP 6.4.2.
Can I assume that the fix you suggested before would work also in 6.4.2, this time the line numbers in fock_dbl.F being 1239 and 1289?

Thanks a lot!
Guy