Guidance on Optimizing Precompiler Options for Performance (MPI_BLOCK and CACHE_SIZE).

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Locked
Message
Author
hszhao.cn@gmail.com
Full Member
Full Member
Posts: 189
Joined: Tue Oct 13, 2020 11:32 pm

Guidance on Optimizing Precompiler Options for Performance (MPI_BLOCK and CACHE_SIZE).

#1 Post by hszhao.cn@gmail.com » Fri Apr 19, 2024 11:46 pm

Dear VASP Forum Members,

I am reaching out to seek your valuable insights and recommendations on correctly setting the precompiler options for optimal performance in VASP simulations. Specifically, I am interested in understanding the best practices for configuring the -DMPI_BLOCK and -DCACHE_SIZE options in relation to system resources such as CPU cache size, the number of processor cores, and overall memory size.

Background Information: Our current computational node includes a Linux-based system with the following characteristics:
- CPU: 2 * AMD EPYC 9554 with 64 cores
- Memory: 24 * 32 GB (4800 MT/s RECC)
- Toolchain: Intel OneAPI 2023.2.0

Below is the current precompiler options suggested by the default makefile.include.intel:

Code: Select all

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
              -DMPI -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dfock_dblbuf
Points of Inquiry:

1. MPI_BLOCK: Given our system specifications, how should we adjust the -DMPI_BLOCK=8000 setting? Is there a rule of thumb or a formula to calculate the ideal block size related to the number of cores or the specific characteristics of MPI-based communication?

2. CACHE_SIZE: Can you recommend how to set the -DCACHE_SIZE=4000 option in relation to the CPU's cache size? How does modifying this parameter affect the performance, and what considerations should we keep in mind to balance between computational efficiency and memory usage?

Thank you in advance for your time and help. I look forward to your valuable suggestions.

Best regards,
Zhao

michael_wolloch
Global Moderator
Global Moderator
Posts: 110
Joined: Tue Oct 17, 2023 10:17 am

Re: Guidance on Optimizing Precompiler Options for Performance (MPI_BLOCK and CACHE_SIZE).

#2 Post by michael_wolloch » Mon Apr 22, 2024 12:44 pm

Dear Zhao,

-DCACHE_SIZE is only used by the deprecated in-house FFT routines. They should not be used, thus this setting is not important.

FOR -DMPI_BLOCK, I have to spend some time looking into the source code to figure out possible performance benefits of changing it and interaction with other flags.

I will get back to you with more information as soon as possible,
Michael

michael_wolloch
Global Moderator
Global Moderator
Posts: 110
Joined: Tue Oct 17, 2023 10:17 am

Re: Guidance on Optimizing Precompiler Options for Performance (MPI_BLOCK and CACHE_SIZE).

#3 Post by michael_wolloch » Thu May 02, 2024 10:50 am

Dear Zhao,

sorry it took a while, but I finally have gathered a bit more information regarding the precompiler option -DMPI_BLOCK:

The MPI_BLOCK size specifies the block size when splitting up large arrays in global summations and all-to-all communication using an in-house algorithm. This is no longer recommended in most cases.

If -DUSE_COLLECTIVE is set (which is the case for all currently provided makefile.include files), MPI collectives are used for all-to-all communication and global summations instead of those routines. This is recommended and removes the code's dependence on the MPI_BLOCK size to a very large extent.

-DMPI_INPLACE also comes into play here (Currently only set for the NEC and openACC compilers): If this precompiler flag is set, most MPI routines will not split up the arrays for copying and MPI_allreduce, even when -DUSE_COLLECTIVE is not set.

One of the few routines that still uses MPI_BLOCK, even if -DUSE_COLLECTIVE is turned on, is M_SUM_MASTER_D, which performs a sum on n double precision numbers to the master rank. This routine is used e.g. in KPAR_SYNC_ALL, which syncs orbitals, eigenvalues, and occupancies.

You can turn on profiling, play around with MPI_BLOCK, and check the impact by reviewing the timings of 'kpar_sync_all'.

Be advised, that results will not only depend on the setting of -DMPI_BLOCK, but also your problem size, number of MPI_ranks, number of nodes, type of interconnect, and MPI implementation and version. We thus cannot recommend a specific setting of MPI_BLOCK for your hardware, since the interplay between different parameters is substantial. But the effects of changing it are probably not large if -DUSE_COLLECTIVE is set.

To answer your question in short:
-DCACHE_SIZE is deprecated and irrelevant
-DMPI_BLOCK is mostly irrelevant. Tuning it for your hardware and calculation is possible, but cumbersome, and I expect no significant performance gains from optimizing it.


Please let me know if this answered your question and if I can lock the topic.
Cheers, Michael

hszhao.cn@gmail.com
Full Member
Full Member
Posts: 189
Joined: Tue Oct 13, 2020 11:32 pm

Re: Guidance on Optimizing Precompiler Options for Performance (MPI_BLOCK and CACHE_SIZE).

#4 Post by hszhao.cn@gmail.com » Sat May 04, 2024 2:17 am

Dear Michael,

Got it. Thank you very much for your thorough and penetrating analysis.

Regards,
Zhao

Locked