Page 1 of 1

VASP 4.6 Parallel Hangs at Run Time

Posted: Fri Dec 15, 2006 3:20 pm
by GeorgetownARC
I have successfully compiled VASP for serial and parallel use. I can run serial jobs without any problem, but parallel jobs launch, then immediately hang. I compiled the parallel version with the following components:

RedHat Linux ELAS 4.0, update 4 (64-bit)
VASP 4.6
Portland pgf90 6.0-8 64-bit
mpich2-1.0.4p1
fftw-3.1.2
GotoBLAS-1.09

I have a job that runs fine using the serial version. When I launch the same job using the parallel version, vasp starts and then hangs (meaning that it doesn't use any memory or CPU). There are no error messages in the output file or on the screen. In fact, there are no messages what so ever. This is making it very hard to debug.

Here is the command that I use to launch MPICH2:

mpdboot -n 6 -f ../mpd.hosts

Here is the command that I use to launch parallel vasp:

mpiexec -machinefile freenodes -n 2 /home/jess/NewVaspSRC/vasp.4.6-parallel/3d-debug/vasp.4.6/vasp < /home/jess/SakuraVASP/POSCAR >/home/jess/SakuraVASP/jess_output

Here is the state of the MPICH2 and vasp programs:

jess 6851 0.0 0.3 87144 7692 ? S 10:14 0:00 python2.3 /opt/mpich2/bin/mpd.py --ncpus=1 -e -d
jess 6863 0.0 0.3 86156 6848 pts/3 S 10:15 0:00 python2.3 /opt/mpich2/bin/mpiexec -machinefile freenodes -n 2 /home/j
jess 6864 0.0 0.3 87148 7700 ? S 10:15 0:00 python2.3 /opt/mpich2/bin/mpd.py --ncpus=1 -e -d
jess 6865 0.0 0.3 87148 7700 ? S 10:15 0:00 python2.3 /opt/mpich2/bin/mpd.py --ncpus=1 -e -d
jess 6866 0.0 0.0 20504 1016 ? S 10:15 0:00 /home/jess/NewVaspSRC/vasp.4.6-parallel/3d-debug/vasp.4.6/vasp
jess 6867 0.0 0.0 20504 1092 ? S 10:15 0:00 /home/jess/NewVaspSRC/vasp.4.6-parallel/3d-debug/vasp.4.6/vasp

The output file is empty, even after I kill the job:

-rw-r--r-- 1 jess users 0 Dec 15 10:15 jess_output

I have tried recompiling the parallel vasp with debugging options, though I still don't get any messages. Here are the debugging settings that I added to vasp's Makefile:

FFLAGS = -Mfree -tp k8-64 -i8 -C -g

# Under the MPI section
CPP = $(CPP_) -DMPI -DHOST=\"LinuxIFC\" -DIFC \
-Dkind8 -DNGZhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
-DMPI_BLOCK=500 \
-DRPROMU_DGEMV -DRACCMU_DGEMV -Ddebug

Does anyone know of additional debugging/verbose options that I can set so vasp will display any type of message?

VASP 4.6 Parallel Hangs at Run Time

Posted: Fri Dec 15, 2006 4:15 pm
by tjf
Firstly, you shouldn't try to stream POSCAR onto stdin. POSCAR, INCAR, etc, are picked up from the execution directory, just like in the serial case. I've no idea how mpich2 handles streamed input (stream handling has been an issue for me with various MPI implementations and codes).

I assume you can run an MPI Hello World?

VASP 4.6 Parallel Hangs at Run Time

Posted: Sat Dec 16, 2006 3:07 am
by GeorgetownARC
Thank you for this tip. I now get an error message:

[cli_0]: aborting job:
Fatal error in MPI_Cart_sub: Invalid communicator, error stack:
MPI_Cart_sub(198): MPI_Cart_sub(MPI_COMM_NULL, remain_dims=0xa80b90, comm_new=0xcb04e0) failed
MPI_Cart_sub(80).: Null communicator
[cli_1]: aborting job:
Fatal error in MPI_Cart_sub: Invalid communicator, error stack:
MPI_Cart_sub(198): MPI_Cart_sub(MPI_COMM_NULL, remain_dims=0xa80b90, comm_new=0xcb04e0) failed
MPI_Cart_sub(80).: Null communicator

MPICH2 Hello World programs work, but I am double-checking that MPICH2 was compiled for 64-bit, or that it works with 64-bit since it seems that others have had these same errors. The default path for pgf90 points to the 64-bit version, but that doesn't mean that it was compiled correctly for 64-bit.

Thanks for your help. I'll post what I figure out.

VASP 4.6 Parallel Hangs at Run Time

Posted: Mon Jan 22, 2007 7:02 am
by job
You need to remove "-i8" from FFLAGS, since MPI expects 32-bit integers. Also, if you want to use fftw without "-i8" you need to change the code so that the kind of the integers used for storing the fftw plans are large enough, i.e. 64 bits. You can find a patch for that floating around in this forum, courtesy of yours truly.