error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Message
Author
amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#1 Post by amihai_silverman1 » Tue Jul 04, 2023 6:32 am

I compiled vasp.6.4.1 in a NVIDIA HPC-SDK container in a Nvidia DGX cluster with A100 GPU's.
The container was downloaded from the Nvidia NGC, its version is nvidia+nvhpc+23.5-devel-cuda_multi-ubuntu22.04
I used makefile.include.nvhpc_omp_acc.
The compilation inside the container completed successfully, but when I submit a job with Slurm using this container, I get an error :

/usr/local/vasp.6.4.1/bin/vasp_std: error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
task 0: Exited with exit code 127

I observe that the library libqdmod.so.0 exists in the container :
# find /opt/nvidia/hpc_sdk/ -name libqdmod.so
/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqdmod.so

and its path is in the $LD_LIBRARY_PATH
# echo $LD_LIBRARY_PATH
/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/nvshmem/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/nccl/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/cuda/extras/CUPTI/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/cuda/lib64::

I added the library path to the slurm submit script
export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib:$LD_LIBRARY_PATH
but the run failed with the same error.

Attached are the makefile.include, the slurm submit script and the output.txt error message.
Thank you, Amihai
You do not have the required permissions to view the files attached to this post.

jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 215
Joined: Fri Jul 01, 2022 2:17 pm

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#2 Post by jonathan_lahnsteiner2 » Tue Jul 04, 2023 2:00 pm

Dear amihai_silverman1,

Could you please try the ldd command on your vasp executable.
The ldd command outputs the libraries you were linking against during
compilation and shows their respective path.

Please put the following command in your job script:

Code: Select all

ldd /usr/local/vasp.6.4.1/bin/vasp_std
Maybe this already helps you to resolve your issue.
Otherwise please post the output and we will see how to proceed.

All the best Jonathan

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#3 Post by amihai_silverman1 » Wed Jul 05, 2023 8:05 am

Thank you for the reply.
Typing ldd /usr/local/vasp.6.4.1/bin/vasp_std inside the container gives a list of libraries which exists in the container.
Typing ldd for vasp.6.4.1/bin/vasp_std outside the container show that most of the libraries are missing.
How do I tell the slurm script to use the libraries inside the container ?
I have tried may options, but can't get it right, I will be grateful for your help.
Thanks, Amihai

jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 215
Joined: Fri Jul 01, 2022 2:17 pm

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#4 Post by jonathan_lahnsteiner2 » Wed Jul 05, 2023 8:35 am

Dear amihai_silverman1,

I am sorry but with the information you are supplying I am not able to help you.
As already asked could you please run the ldd command in your
slum job script and post the output in the forum. According to the job script you sent you should modify it as follows:

Code: Select all

 #!/bin/bash
 #SBATCH --ntasks=1
 #SBATCH --cpus-per-task=16
 #SBATCH --gpus=1
 #SBATCH --qos=basic
 
 export OMPI_ALLOW_RUN_AS_ROOT=1
 export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib:$LD_LIBRARY_PATH

 srun --container-image=/rg/spatari_prj/amihai/vasp/nvidia+nvhpc+vasp.sqsh --container-mounts=/rg/spatari_prj/amihai/vasp/NaCl:/home/NaCl --container-workdir=/home/NaCl ldd --allow-run-as-root /usr/local/vasp.6.4.1/bin/vasp_std >& output_ldd.txt

 srun --container-image=/rg/spatari_prj/amihai/vasp/nvidia+nvhpc+vasp.sqsh --container-mounts=/rg/spatari_prj/amihai/vasp/NaCl:/home/NaCl --container-workdir=/home/NaCl mpirun -np 1 --allow-run-as-root /usr/local/vasp.6.4.1/bin/vasp_std >& output.txt

After running the job script please post the file output_ldd.txt to the forum. I need this information to guide you further trough your problem.

With many thanks and kind regards

Jonathan

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#5 Post by amihai_silverman1 » Wed Jul 05, 2023 11:26 am

Hi Jonathan,
Attached are the output_ldd.txt and submit script.
I see there:
libqdmod.so.0 => not found
but in the container:

i# ls -l /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib
total 1656
-rw-r--r-- 1 root root 562410 May 23 21:20 libqd.a
-rw-r--r-- 1 root root 971 May 23 21:20 libqd.la
lrwxrwxrwx 1 root root 14 May 23 21:47 libqd.so -> libqd.so.0.0.0
lrwxrwxrwx 1 root root 14 May 23 21:47 libqd.so.0 -> libqd.so.0.0.0
-rw-r--r-- 1 root root 313800 May 23 21:20 libqd.so.0.0.0
-rw-r--r-- 1 root root 2982 May 23 21:20 libqd_f_main.a
-rw-r--r-- 1 root root 1020 May 23 21:20 libqd_f_main.la
lrwxrwxrwx 1 root root 21 May 23 21:47 libqd_f_main.so -> libqd_f_main.so.0.0.0
lrwxrwxrwx 1 root root 21 May 23 21:47 libqd_f_main.so.0 -> libqd_f_main.so.0.0.0
-rw-r--r-- 1 root root 9240 May 23 21:20 libqd_f_main.so.0.0.0
-rw-r--r-- 1 root root 579318 May 23 21:20 libqdmod.a
-rw-r--r-- 1 root root 992 May 23 21:20 libqdmod.la
lrwxrwxrwx 1 root root 17 May 23 21:47 libqdmod.so -> libqdmod.so.0.0.0
lrwxrwxrwx 1 root root 17 May 23 21:47 libqdmod.so.0 -> libqdmod.so.0.0.0
-rw-r--r-- 1 root root 200968 May 23 21:20 libqdmod.so.0.0.0

ldd in an interactive bash in the container gives a different result:

i# ldd /usr/local/vasp.6.4.1/bin/vasp_std
linux-vdso.so.1 (0x00007ffe07fb3000)
libqdmod.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqdmod.so.0 (0x00007fc0ba200000)
libqd.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqd.so.0 (0x00007fc0b9e00000)
liblapack_lp64.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/liblapack_lp64.so.0 (0x00007fc0b9000000)
libblas_lp64.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libblas_lp64.so.0 (0x00007fc0b7800000)
libfftw3.so.3 => /usr/lib/x86_64-linux-gnu/libfftw3.so.3 (0x00007fc0b75e5000)
libfftw3_omp.so.3 => /usr/lib/x86_64-linux-gnu/libfftw3_omp.so.3 (0x00007fc0ba476000)
libmpi_usempif08.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/11.8/hpcx/hpcx-2.14/ompi/lib/libmpi_usempif08.so.40 (0x00007fc0b7200000)
libmpi_usempi_ignore_tkr.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/11.8/hpcx/hpcx-2.14/ompi/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007fc0b6e00000)
libmpi_mpifh.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/11.8/hpcx/hpcx-2.14/ompi/lib/libmpi_mpifh.so.40 (0x00007fc0b6a00000)
libmpi.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/11.8/hpcx/hpcx-2.14/ompi/lib/libmpi.so.40 (0x00007fc0b6600000)
libscalapack_lp64.so.2 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/11.8/hpcx/hpcx-2.14/ompi/lib/libscalapack_lp64.so.2 (0x00007fc0b5c00000)
libnvhpcwrapcufft.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libnvhpcwrapcufft.so (0x00007fc0b5800000)
libcufft.so.10 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/math_libs/11.0/lib64/libcufft.so.10 (0x00007fc0ab800000)
libcusolver.so.10 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/math_libs/11.0/lib64/libcusolver.so.10 (0x00007fc08a800000)
libcudaforwrapnccl.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libcudaforwrapnccl.so (0x00007fc08a400000)
libnccl.so.2 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/nccl/lib/libnccl.so.2 (0x00007fc07a000000)
libcublas.so.11 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/math_libs/11.0/lib64/libcublas.so.11 (0x00007fc074000000)
libcublasLt.so.11 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/math_libs/11.0/lib64/libcublasLt.so.11 (0x00007fc068e00000)
libcudaforwrapblas.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libcudaforwrapblas.so (0x00007fc068a00000)
libcudart.so.11.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/cuda/11.0/lib64/libcudart.so.11.0 (0x00007fc068600000)
libcudafor_110.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libcudafor_110.so (0x00007fc063a00000)
libcudafor.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libcudafor.so (0x00007fc063600000)
libacchost.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libacchost.so (0x00007fc063200000)
libaccdevaux110.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libaccdevaux110.so (0x00007fc062e00000)
libacccuda.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libacccuda.so (0x00007fc062a00000)
libcudadevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libcudadevice.so (0x00007fc062600000)
libcudafor2.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libcudafor2.so (0x00007fc062200000)
libnvf.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libnvf.so (0x00007fc061a00000)
libnvhpcatm.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libnvhpcatm.so (0x00007fc061600000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc0613d6000)
libnvomp.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libnvomp.so (0x00007fc060200000)
libnvcpumath.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libnvcpumath.so (0x00007fc05fc00000)
libnvc.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib/libnvc.so (0x00007fc05f800000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007fc05f5d8000)
libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc0ba44c000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007fc0ba119000)
libatomic.so.1 => /usr/lib/x86_64-linux-gnu/libatomic.so.1 (0x00007fc0ba442000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc0ba43d000)
libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc0ba436000)
libopen-rte.so.40 => /usr/lib/x86_64-linux-gnu/libopen-rte.so.40 (0x00007fc0ba05c000)
libopen-pal.so.40 => /usr/lib/x86_64-linux-gnu/libopen-pal.so.40 (0x00007fc0b9d4d000)
libutil.so.1 => /usr/lib/x86_64-linux-gnu/libutil.so.1 (0x00007fc0ba431000)
libz.so.1 => /usr/lib/x86_64-linux-gnu/libz.so.1 (0x00007fc0ba040000)
librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007fc0ba42a000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc0ba48a000)
libhwloc.so.15 => /usr/lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007fc0b9cf1000)
libevent_core-2.1.so.7 => /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7 (0x00007fc0b9cbc000)
libevent_pthreads-2.1.so.7 => /usr/lib/x86_64-linux-gnu/libevent_pthreads-2.1.so.7 (0x00007fc0ba423000)
libudev.so.1 => /usr/lib/x86_64-linux-gnu/libudev.so.1 (0x00007fc0b9c92000)


Thanks, Amihai
You do not have the required permissions to view the files attached to this post.

jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 215
Joined: Fri Jul 01, 2022 2:17 pm

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#6 Post by jonathan_lahnsteiner2 » Wed Jul 05, 2023 11:47 am

Dear amihai_silverman1,

Thank you for submitting the output of the ldd command.
But I fear there is not much I can do for you. From the output in your interactive bash
shell there are no issues visible, because the libqdmod.so.0 is found, indicated by the following lines

Code: Select all

libqdmod.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqdmod.so.0 (0x00007fc0ba200000)
libqd.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqd.so.0 (0x00007fc0b9e00000)
Note here that both of this files are in the folder:

Code: Select all

 /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras
When you are executing the vasp code with a slurm job script you got the output:

Code: Select all

error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
And when running the ldd command from within your job script, you got:

Code: Select all

libqdmod.so.0 => not found
libqd.so.0 => not found 
My guess is now that you have access rights to the folder

Code: Select all

 /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras
from your interactive shell. But it seems you don't have access to the same folder within your job script. Therefore the discrepancy between the output in the job script and the interactive shell. I would recommend to talk to your system administrator and show the information you already gathered to him/her.
I am sorry that I can not be of more help.

All the best Jonathan

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#7 Post by amihai_silverman1 » Wed Jul 05, 2023 12:43 pm

Thank you
In there a way to compile vasp using the static libraries libqdmod and libqd ?
This may solve the problem since only these two are missing.
Amihai

jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 215
Joined: Fri Jul 01, 2022 2:17 pm

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#8 Post by jonathan_lahnsteiner2 » Wed Jul 05, 2023 1:32 pm

Dear amihai_silverman1,

I would strongly discourage you from compiling the library yourself. This would mean that you have to compile
the whole NVIDIA-HPC package. I would strongly recommend to just talk to your system administrator why you do not have
access to the folder /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras or one of its sub folders when accessing it from your slurm script.

Another possibility I can think of is copy the files

Code: Select all

libqdmod.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqdmod.so.0
libqd.so.0 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqd.so.0
when in interactive shell to some other location.
Note the files have to have the same name. So don't change the name libqdmod.so.0 into something else when copying.
Then export the path where you copied your files to in your slurm script.
As an example

Code: Select all

export LD_LIBRARY_PATH=/ABSOLUTE_PATH_WHERE_YOU_COPIED_libqdmod.so_TO/:$LD_LIBRARY_PATH
I hope this is of help.
All the best Jonathan

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#9 Post by amihai_silverman1 » Sun Jul 09, 2023 5:15 am

Hello,
I managed to solve this problem by setting makefile.include to use the static libraries qd and qdmod. I put in makefile.include the lines :

QD ?= $(NVROOT)/compilers/extras/qd
LLIBS += -L/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib -Wl,-Bstatic -lqdmod -lqd -Wl,-Bdynamic
INCS += -I$(QD)/include/qd

Now vasp runs but gives an error.
Even when I try the simple H2O example from the tutorial and run it interactively and I get the following error :

# /usr/local/vasp.6.4.1/bin/vasp_std
running 1 mpi-ranks, with 8 threads/rank, on 1 nodes
distrk: each k-point on 1 cores, 1 groups
distr: one band on 1 cores, 1 groups
OpenACC runtime initialized ... 1 GPUs detected

Code: Select all

 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: mpi.F  at line: 898                                  |
|                                                                             |
|     M_init_nccl: Error in ncclCommInitRank                                  |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------
Warning: ieee_invalid is signaling
Warning: ieee_divide_by_zero is signaling
Warning: ieee_inexact is signaling
1

Tnx, Amihai

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#10 Post by amihai_silverman1 » Tue Jul 11, 2023 11:20 am

I submitted the last problem to the Bugreports forum.

jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 215
Joined: Fri Jul 01, 2022 2:17 pm

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#11 Post by jonathan_lahnsteiner2 » Tue Jul 11, 2023 3:56 pm

Dear amihai_silverman1,

The bug that your are reporting is a result of not compiling vasp properly.
As suggested in my previous post you could try to copy the files
libqdmod.so.0, libqd.so.0 to some location that can be assessed from your
slurm job script.

All the best

Jonathan

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#12 Post by amihai_silverman1 » Wed Jul 12, 2023 7:40 am

Hi,
previously I did exactly that, but it didn't work.
The compilation finds the other libraries in /usr/lib/x86_64-linux-gnu, but it doesn't find libqdmod. I can't understand that :

cp -p /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/* /usr/lib/x86_64-linux-gnu
cp -rp /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/include/qd /usr/include/x86_64-linux-gnu

ls -l /usr/lib/x86_64-linux-gnu/libqd*
-rw-r--r-- 1 root root 971 May 23 21:20 /usr/lib/x86_64-linux-gnu/libqd.la
lrwxrwxrwx 1 root root 14 Feb 20 2022 /usr/lib/x86_64-linux-gnu/libqd.so.0 -> libqd.so.0.0.0
-rw-r--r-- 1 root root 191152 Feb 20 2022 /usr/lib/x86_64-linux-gnu/libqd.so.0.0.0
-rw-r--r-- 1 root root 1020 May 23 21:20 /usr/lib/x86_64-linux-gnu/libqd_f_main.la
lrwxrwxrwx 1 root root 21 Feb 20 2022 /usr/lib/x86_64-linux-gnu/libqd_f_main.so.0 -> libqd_f_main.so.0.0.0
-rw-r--r-- 1 root root 14360 Feb 20 2022 /usr/lib/x86_64-linux-gnu/libqd_f_main.so.0.0.0
-rw-r--r-- 1 root root 992 May 23 21:20 /usr/lib/x86_64-linux-gnu/libqdmod.la
lrwxrwxrwx 1 root root 17 Feb 20 2022 /usr/lib/x86_64-linux-gnu/libqdmod.so.0 -> libqdmod.so.0.0.0
-rw-r--r-- 1 root root 154640 Feb 20 2022 /usr/lib/x86_64-linux-gnu/libqdmod.so.0.0.0

The compilation fails :

mpif90 -acc -gpu=cc60,cc70,cc80,cuda11.0 -mp -c++libs -o vasp c2f_interface.o nccl2for.o simd.o base.o profiling.o string.o tutor.o version.o build_info.o command_line.o vhdf5_base.o incar_reader.o reader_base.o openmp.o openacc_struct.o mpi.o mpi_shmem.o mathtools.o bse_struct.o hamil_struct.o radial_struct.o pseudo_struct.o mgrid_struct.o wave_struct.o nl_struct.o mkpoints_struct.o poscar_struct.o afqmc_struct.o fock_glb.o chi_glb.o smart_allocate.o xml.o extpot_glb.o constant.o ml_ff_c2f_interface.o ml_ff_prec.o ml_ff_string.o ml_ff_tutor.o ml_ff_constant.o ml_ff_mpi_help.o ml_ff_neighbor.o ml_ff_taglist.o ml_ff_struct.o ml_ff_mpi_shmem.o vdwforcefield_glb.o jacobi.o main_mpi.o openacc.o scala.o asa.o lattice.o poscar.o ini.o mgrid.o ml_asa2.o ml_ff_mpi.o ml_ff_helper.o ml_ff_logfile.o ml_ff_math.o ml_ff_iohandle.o ml_ff_memory.o ml_ff_abinitio.o ml_ff_ff2.o ml_ff_ff.o ml_ff_mlff.o setex_struct.o xclib.o vdw_nl.o xclib_grad.o setex.o radial.o pseudo.o gridq.o ebs.o symlib.o mkpoints.o random.o wave.o wave_mpi.o wave_high.o bext.o spinsym.o symmetry.o lattlib.o nonl.o nonlr.o nonl_high.o dfast.o choleski2.o mix.o hamil.o xcgrad.o xcspin.o potex1.o potex2.o constrmag.o cl_shift.o relativistic.o LDApU.o paw_base.o metagga.o egrad.o pawsym.o pawfock.o pawlhf.o diis.o rhfatm.o hyperfine.o fock_ace.o paw.o mkpoints_full.o charge.o Lebedev-Laikov.o stockholder.o dipol.o solvation.o scpc.o pot.o fermi_energy.o tet.o dos.o elf.o hamil_rot.o bfgs.o dynmat.o instanton.o lbfgs.o sd.o cg.o dimer.o bbm.o fire.o lanczos.o neb.o qm.o pyamff_fortran/*.o ml_pyamff.o opt.o chain.o dyna.o fileio.o vhdf5.o sphpro.o us.o core_rel.o aedens.o wavpre.o wavpre_noio.o broyden.o dynbr.o reader.o writer.o xml_writer.o brent.o stufak.o opergrid.o stepver.o fast_aug.o fock_multipole.o fock.o fock_dbl.o fock_frc.o mkpoints_change.o subrot_cluster.o sym_grad.o mymath.o npt_dynamics.o subdftd3.o subdftd4.o internals.o dynconstr.o dimer_heyden.o dvvtrajectory.o vdwforcefield.o nmr.o pead.o k-proj.o subrot.o subrot_scf.o paircorrection.o rpa_force.o ml_reader.o ml_interface.o force.o pwlhf.o gw_model.o optreal.o steep.o rmm-diis.o davidson.o david_inner.o root_find.o lcao_bare.o locproj.o electron_common.o electron.o rot.o electron_all.o shm.o pardens.o optics.o constr_cell_relax.o stm.o finite_diff.o elpol.o hamil_lr.o rmm-diis_lr.o subrot_lr.o lr_helper.o hamil_lrf.o elinear_response.o ilinear_response.o linear_optics.o setlocalpp.o wannier.o electron_OEP.o electron_lhf.o twoelectron4o.o gauss_quad.o m_unirnk.o minimax_ini.o minimax_dependence.o minimax_functions1D.o minimax_functions2D.o minimax_struct.o minimax_varpro.o minimax.o umco.o mlwf.o ratpol.o pade_fit.o screened_2e.o wave_cacher.o crpa.o chi_base.o wpot.o local_field.o ump2.o ump2kpar.o fcidump.o ump2no.o bse_te.o bse.o time_propagation.o acfdt.o afqmc.o rpax.o chi.o acfdt_GG.o dmft.o GG_base.o greens_orbital.o lt_mp2.o rnd_orb_mp2.o greens_real_space.o chi_GG.o chi_super.o sydmat.o rmm-diis_mlr.o linear_response_NMR.o wannier_interpol.o wave_interpolate.o linear_response.o auger.o dmatrix.o phonon.o wannier_mats.o elphon.o core_con_mat.o embed.o extpot.o rpa_high.o fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o main.o -Llib -ldmy -Lparser -lparser -cudalib=cublas,cusolver,cufft,nccl -cuda -L/usr/lib/x86_64-linux-gnu -lfftw3 -lfftw3_omp -Mscalapack -llapack -lblas -lqdmod -lqd
/usr/bin/ld: cannot find -lqdmod: No such file or directory
/usr/bin/ld: cannot find -lqd: No such file or directory
pgacclnk: child process exit status 1: /usr/bin/ld
make[2]: *** [makefile:132: vasp] Error 2

jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 215
Joined: Fri Jul 01, 2022 2:17 pm

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#13 Post by jonathan_lahnsteiner2 » Wed Jul 12, 2023 8:04 am

Dear amihai_silverman1,

Did you export the LD_LIBRARY_PATH to the nvidia files before compiling. It think this is what the
compiler is telling you.
In principle you already had a compiled vasp version in your first post with dynamic linking.
You only had a problem when executing the code because the two files

Code: Select all

/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqdmod.so.0
/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqd.so.0
where not accessible from your job script.
So I recommended you to either talk to you system administrator to get access to this folder from your job script.
The other possibility was to copy the two files

Code: Select all

/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqdmod.so.0
/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib/libqd.so.0
to some location where you have access to and export this path in your slurm job script.

Code: Select all

export LD_LIBRARY_PATH=/ABSOLUTE_PATH_WHERE_YOU_COPIED_libqdmod.so_TO/:$LD_LIBRARY_PATH
Could you please try this. Because then you would not have to recompile vasp again.
I hope this works. If it does not please contact us again and send again the std output of the slurm job script.
And the output ldd command.

Code: Select all

#!/bin/bash
 #SBATCH --ntasks=1
 #SBATCH --cpus-per-task=16
 #SBATCH --gpus=1
 #SBATCH --qos=basic
 
 export OMPI_ALLOW_RUN_AS_ROOT=1
 export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/extras/qd/lib:$LD_LIBRARY_PATH
 export LD_LIBRARY_PATH=/ABSOLUTE_PATH_WHERE_YOU_COPIED_libqdmod.so_TO/:$LD_LIBRARY_PATH

 srun --container-image=/rg/spatari_prj/amihai/vasp/nvidia+nvhpc+vasp.sqsh --container-mounts=/rg/spatari_prj/amihai/vasp/NaCl:/home/NaCl --container-workdir=/home/NaCl ldd --allow-run-as-root /usr/local/vasp.6.4.1/bin/vasp_std >& output_ldd.txt

 srun --container-image=/rg/spatari_prj/amihai/vasp/nvidia+nvhpc+vasp.sqsh --container-mounts=/rg/spatari_prj/amihai/vasp/NaCl:/home/NaCl --container-workdir=/home/NaCl mpirun -np 1 --allow-run-as-root /usr/local/vasp.6.4.1/bin/vasp_std >& output.txt
All the best Jonathan

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#14 Post by amihai_silverman1 » Wed Jul 12, 2023 11:57 am

Hi, you are correct.

I started from the beginning, downloaded vasp.6.4.1, and put it in a container nvidia+nvhpc+23.5-devel-cuda_multi-ubuntu22.04.sqsh as it was downloaded from Nvidia.
I installed libfftw3-3, used makefile.include.nvhpc_acc, and put there the fftw lib path.

The compilation run with no errors, but not when I run the H2O example I get :

Code: Select all

/H2O# mpirun  --allow-run-as-root -np 1  /usr/local/vasp.6.4.1/bin/vasp_std
 running    1 mpi-ranks, on    1 nodes
 distrk:  each k-point on    1 cores,    1 groups
 distr:  one band on    1 cores,    1 groups
 OpenACC runtime initialized ...    1 GPUs detected
 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: mpi.F  at line: 898                                  |
|                                                                             |
|     M_init_nccl: Error in ncclCommInitRank                                  |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------

Warning: ieee_inexact is signaling
    1
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[15360,1],0]
Exit code: 1
--------------------------------------------------------------------------

Thank you for your help,
Amihai

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: error - shared libraries: libqdmod.so.0 in NVIDIA HPC-SDK container

#15 Post by amihai_silverman1 » Thu Jul 13, 2023 6:41 am

One more comment :
Previously I have compiled the same code in cpu HPC cluster using the oneapi Intel compilers. This runs properly.
Since we need more compute power, I try now to compile the same code on a Nvidia DGX cluster inside a hpc-sdk container, as was recommended in the vasp installation instructions. The compilation completes with no errors but running an example gives an error.
Maybe there is some inconsistency between this code and the compilers provided by Nvidia in the latest hpc-sdk version 23.5-devel-cuda_multi-ubuntu22.04 ?

Locked