Page 1 of 1
Using VASP 6.2 OpenACC GPU port
Posted: Fri Jan 29, 2021 8:04 pm
by david_keller
After
https://www.vasp.at/forum/viewtopic.php?f=2&t=18020
I ended up being able to compile and link successfully!
Thanks for all the help!
(BTW - I linked to Intel's MKl BLAS and Scalapack and FFTW libraries.
Are any of these libraries making use of GPUs via OpenACC in the PGI compiled versions?)
Now when I try to run the test suite the very first job errors with:
Code: Select all
VASP_TESTSUITE_RUN_FAST="Y"
Executed at: 14_54_01/29/21
==================================================================
------------------------------------------------------------------
CASE: andersen_nve
------------------------------------------------------------------
CASE: andersen_nve
entering run_recipe andersen_nve
andersen_nve step STD
------------------------------------------------------------------
andersen_nve step STD
entering run_vasp_g
running on 4 total cores
distrk: each k-point on 2 cores, 2 groups
distr: one band on 1 cores, 2 groups
OpenACC runtime initialized ... 1 GPUs detected
-----------------------------------------------------------------------------
| |
| EEEEEEE RRRRRR RRRRRR OOOOOOO RRRRRR ### ### ### |
| E R R R R O O R R ### ### ### |
| E R R R R O O R R ### ### ### |
| EEEEE RRRRRR RRRRRR O O RRRRRR # # # |
| E R R R R O O R R |
| E R R R R O O R R ### ### ### |
| EEEEEEE R R R R OOOOOOO R R ### ### ### |
| |
| M_init_nccl: Error in ncclCommInitRank |
| |
| ----> I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <---- |
| |
-----------------------------------------------------------------------------
Re: VASP 6.2
Posted: Mon Feb 01, 2021 7:14 pm
by david_keller
Is it possible to build and run an OpenACC version with Openmp on as well?
How can I best make of a server with 1GPU and 40 CPUs?
If I were to get 3 More GPUs on my server how would I best run in in terms of number of proesses, number of threads and number of GPUs?
Also, I may have already submitted a similar reply but I can not see if there is one waiting in queue for review somewhere?
Again,
Thanks for all your help.
Using VASP 6.2 OpenACC GPU port
Posted: Tue Feb 02, 2021 11:01 am
by mmarsman
Hi David,
The NCCL error message you encounter is probably a consequence of the fact that you start VASP with a number of MPI-ranks that is greater than the number of GPUs you have available.
Unfortunately the current versions of NCCL do not allow MPI-ranks to share a GPU, so you are forced to use one MPI-rank per GPU.
This information still needs to go onto our wiki (I'm currently working on it, sorry for the delay and the inconvenience this caused you!).
In case you have a CPU with 40 cores (threads?) and only one GPU, that is indeed a bit unfortunate.
Getting 3 more GPUs would of course change this somewhat (in addition to adding a whopping amount of compute power GPU-side).
In addition, another option you already allude to yourself: you can add OpenMP into the mix as well.
Yes, it is indeed a good idea to build the OpenACC version with OpenMP support.
In that case each MPI-rank may spawn a few threads and you'll get better CPU-usage in those code paths that still remain CPU side.
I will add a "makefile.include" template for OpenACC+OpenMP to the wiki ASAP.
Another solution to the above would be not to use NCCL and use MPS to have the MPI-ranks share a GPU.
That probably is still the best option for small calculations, where a single MPI-ranks has trouble saturating the GPU.
... at the moment however there is a part of the code that breaks without NCCL support (the hybrid functionals).
I'm working on changing that!
Cheers,
Martijn Marsman
Re: VASP 6.2
Posted: Wed Feb 03, 2021 1:57 pm
by david_keller
Thanks Martijn,
I will keep my eye out for a wiki update on the OpenACC+OpenMP build.
Another question on the 6.2 documentation that explains usage.
You Doc says:
The execution statement depends heavily on your system! Our reference system consists of compute nodes with 4 cores per CPU. The examples job script given here is for a job occuping a total of 64 cores, so 16 phyiscal nodes.
On our clusters
1 openMPI process per node
#$ -N test
#$ -q narwal.q
#$ -pe orte* 64
mpirun -bynode -np 8 -x OMP_NUM_THREADS=8 vasp
The MPI option -bynode ensures that the VASP processes are started in a round robin fashion, so each of the physical nodes gets 1 running VASP process. If we miss out this option, on each of the first 4 physical nodes 4 VASP processes would be started, leaving the remaining 12 nodes unoccupied.
It would seem that if you are using 16 nodes with 64 total processors then the mpirun should have either 'np=64' if it refers to all processes or 'np=4' if it refers to processes per node? It would seem that you probably would want more than one thread be process to keep pipelines full?
I am confused...
Re: VASP 6.2
Posted: Wed Feb 03, 2021 4:41 pm
by david_keller
Hi Martijn,
Multiple CPU ranks sharing a GPU would be most likely be good. I am trying to do some profiling to see if there is GPU idle time that could be soaked up in this way.
I succeeded in getting an OpenACC and OpenMP version to compile. My include file will follows. I had to build and OpenMP enabled version of FFTW3 first.
Unfortunately, it seems to slow down throughput of a 1GPU 1CPU run with omp_threads > 1. It slows the run by 15% with omp_threads=2 and 30% with omp_threads=4.
#module load nvidia-hpc-sdk/20.11
#module load anaconda3/2018.12/b2
# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxPGI\" \
-DMPI -DMPI_BLOCK=8000 -DMPI_INPLACE -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dqd_emulate \
-Dfock_dblbuf \
-D_OPENACC \
-DUSENCCL -DUSENCCLP2P -D_OPENMP
CPP = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX) > $*$(SUFFIX)
FC = mpif90 -acc -gpu=cc70,cc80,cuda11.1 -mp
FCL = mpif90 -acc -gpu=cc70,cc80,cuda11.1 -c++libs
FREE = -Mfree
FFLAGS = -Mbackslash -Mlarge_arrays
OFLAG = -fast
DEBUG = -Mfree -O0 -traceback
# Specify your NV HPC-SDK installation, try to set NVROOT automatically
NVROOT =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')
# ...or set NVROOT manually
#NVHPC ?= /opt/nvidia/hpc_sdk
#NVVERSION = 20.9
#NVROOT = $(NVHPC)/Linux_x86_64/$(NVVERSION)
# Use NV HPC-SDK provided BLAS and LAPACK libraries
BLAS = -lblas
LAPACK = -llapack
BLACS =
SCALAPACK = -Mscalapack
CUDA = -cudalib=cublas,cusolver,cufft,nccl -cuda
LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) $(CUDA)
# Software emulation of quadruple precsion
QD ?= $(NVROOT)/compilers/extras/qd
LLIBS += -L$(QD)/lib -lqdmod -lqd
INCS += -I$(QD)/include/qd
# Use the FFTs from fftw
#FFTW ?= /opt/gnu/fftw-3.3.6-pl2-GNU-5.4.0
#FFTW = /cm/shared/software/fftw3/3.3.8/b6
FFTW = /g1/ssd/kellerd/vasp_gpu_work/fftw-3.3.9_nv
#LLIBS += -L$(FFTW)/lib -lfftw3
LLIBS += -L$(FFTW)/.libs -L$(FFTW)/threads/.libs -lfftw3 -lfftw3_omp
#INCS += -I$(FFTW)/include
INCS += -I$(FFTW)/mpi -I$(FFTW)/api
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
# Redefine the standard list of O1 and O2 objects
SOURCE_O1 := pade_fit.o
SOURCE_O2 := pead.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = nvfortran
CC_LIB = nvc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1 -Mfixed
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o
# For the parser library
CXX_PARS = nvc++ --no_warnings
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
Re: Using VASP 6.2 OpenACC GPU port
Posted: Thu Feb 04, 2021 1:33 pm
by david_keller
Buildin using the nvidia-hpc-sdk there is an unsatisfied referance to libatomic.so.
Where can this be found?
Re: Using VASP 6.2 OpenACC GPU port
Posted: Wed Feb 10, 2021 6:58 pm
by mmarsman
Hi David,
So I've finally found some time to work on an OpenACC wiki-page (not finished yet but getting there):
wiki/index.php/OpenACC_GPU_port_of_VASP
I've put in a link to an OpenACC+OpenMP makefile.include as well .
For the latter case I link to Intel's MKL library for CPU-sided FFTS, BLAS, LAPACK, and scaLAPACK calls.
Unsurprisingly, this is unbeatable on Intel CPUs, especially where threaded FFTs are concerned.
I do not know what CPU you have in your system but if it's an Intel then use MKL.
Unfortunately I noticed our description of the use of OpenMP (how to place the MPI-ranks and OpenMP threads etc) is outdated and crappy, so I will have to find time to work on that next.
With respect toyour question on the unsatisfied "libatomic.so" in your build with the NVIDIA HPC-SDK: I have no idea, having never encountered this problem myself. But maybe this issue was solved along the way ... I have lost my way in this forum-thread a bit, I fear
Regarding the performance of VASP on GPUs: putting work onto accelerators involves some overhead in the form of data transfers back and forth and launching of kernels. In practice this means that for small jobs you will probably see that the GPUs may not perform as well as you might be hoping (compared to CPU runs).
Correspondingly GPU idle time will be high for small jobs. The CPU will not be able to parcel out the work to the GPU fast enough to saturate it.
This is not surprising considering the enormous amount of flops these cards represent. You may see it as trying to run a small job on a too large number of CPU cores. If there's not enough work to parallelise over, performance will drop at some point.
Another thing: for now please use the NVIDIA HPC-SDK 20.9! According to our contacts at NVIDIA version 20.11 has certain performance issues (with particular constructs in VASP), and in version 21.1 a bug was introduced that may even lead to wrong results.
I was assured these issues will all be fixed in the next release of the NVIDIA HPC-SDK (v21.2).
(I will put this in the wiki as well.)
Cheers,
Martijn
Re: Using VASP 6.2 OpenACC GPU port
Posted: Thu Feb 18, 2021 3:17 pm
by david_keller
Thanks Martijn!
We compiled and linked using SDK 20.9 with the Intel MKL FFT routines rather than SDK 20.11 with fftw-3.3.9 (built using the Nvida SDK).
Our test run elapsed time dramatically changed between a run with OpenACC with 1 GPU and a run with 40 CPUs alone:
20.11 20.09+MKL
Elap Maxd Elap Maxd
1GPU/1CPU 486 .48e-2 348 .70e-2
40 CPU 184 .46e-2 338 .55e-2
So the elapsed time was slower for the CPU run using 20.9+MKL, but the GPU run became faster.
I do not know how significant it is, but the only result that was off in more significant digits was 'maximum difference moved'. Random seeds differ etc. so I presume you should not see identical results, but I do not know how important a max distance moved output would be?
This is after NMD=10.
BTW - we are still trying to get to some results we have seen where a single GPU runs twice as fast as 40 CPUs.
Dave Keller
Re: Using VASP 6.2 OpenACC GPU port
Posted: Tue Feb 23, 2021 2:44 pm
by david_keller
Hi Martijn,
Would you please clarify a couple of things for me?
Is it true that with 6.2 only a single executable need be compiled that will use OpenMP if OMP_NUM_THREADS>1and will be GPU capable if available on the node?
If so, should you want to run NOT using GPUs on a node that has GPUs is there a way to do so? Our experiance is the code will automatically use one if available.
Thanks for your help,
Dave Keller
LLE HPC
Re: Using VASP 6.2 OpenACC GPU port
Posted: Mon Mar 08, 2021 1:44 pm
by david_keller
I found out that setting CUDA_VISIBLE_DEVICES="" (null) will cause the OpenACC version to NOT look for GPUs.
Re: Using VASP 6.2 OpenACC GPU port
Posted: Tue Jun 15, 2021 9:40 am
by mmarsman
Hi David,
Sorry for the delay! In answer to your last questions:
Yes, when you build with OpenACC and OpenMP you will end up with an executable that will use OpenMP when OMP_NUM_THREADS > 1 and will be GPU capable if available on the node.
In principle you can then forbid the use of GPUs by making them "invisible" in the manner you describe.
However, there are several instances in the code where OpenMP threading is inactive as soon as you specify -D_OPENACC (i.e., compile with OpenACC support).
So you will lose OpenMP related performance as soon as you compile with OpenACC support.
The most severe example is in the use of real-space projection operators: in the "normal" OpenMP version of the code, the work related to these real-space projectors is distributed over OpenMP threads, but as soon as you switch request OpenACC support (at compile time) this is no longer the case.
The idea of the OpenACC + OpenMP version is that OpenMP adds some additional CPU performance in those parts that have not been ported to the GPU (yet).
If you want the optimal OpenMP + MPI performance you should compile a dedicated executable *without* OpenACC support.
Cheers,
Martijn