parallel (mpi) problem on BG/L

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
ng
Newbie
Newbie
Posts: 4
Joined: Mon Oct 08, 2007 3:00 am
License Nr.: 168 (no paw so far, request fo upgrade pending)

parallel (mpi) problem on BG/L

#1 Post by ng » Tue Oct 30, 2007 12:29 am

Hi,
I now have a serial version of VASP working on the blue gene architecture (IBM/xlf90 compiler). The parallel version runs up to a point, where I suspect it wants to start farming out work to the slave nodes, and then stops without crashing, until it times out. This may be classified as an MPI error, but there are a few things in the output from VASP itself that maybe someone can help with.
With mpirun, the job starts normally, stdout gets up to
entering main loop
N E dE d eps ncg rms rms(c)
and then (with mpirun verbosity turned up) it prints out
<Oct 30 13:57:16.072298> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobArgs to get job arguments from job table
<Oct 30 13:57:21.175250> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobInfo()
repetitively until the job is killed or times out.

OUTCAR gets up to
.....
Broyden mixing: mesh for mixing (old mesh)
NGX = 9 NGY = 9 NGZ = 9
(NGX = 40 NGY = 40 NGZ = 40)
gives a total of 729 points
initial charge density was supplied:
charge density of overlapping atoms calculated
number of electron 10.0000000 magnetization 2.0000000
keeping initial charge density in first step


--------------------------------------------------------------------------------------------------------


Maximum index for augmentation-charges 10351 (set IRDMAX)


--------------------------------------------------------------------------------------------------------


First call to EWALD: gamma= 0.645
Maximum number of real-space cells 3x 3x 3
Maximum number of reciprocal cells 3x 3x 3

FEWALD: VPU time********: CPU time 0.01

One question is whether the lack of a number for the CPU(I mean VPU) time is significant? This also seems to be the case for the serial version and so I am guessing that the standard timing routines don't exist on the BG/L?

By turning on -Ddebug in the output I see that the code is getting as far as:

orthch done
projections done
wavpre is ok
entering main loop
N E dE d eps ncg rms rms(c)
electron entered
<rho*excgc> = -10.59939013831403
<rho*vxcgc> = -6.68064783212862
xcencgc = -3.91874230618541
LDA: EXC, XCF, CVZERO NAN NAN (NAN,0.000000000000000000E+00)
potlok is ok
setdij is ok
1 0.0000 NAN 0.30E+01E NANR<Oct 30 11:10:46.066242> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobInfo()
<Oct 30 11:10:46.067373> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobArgs to get job arguments from job table
<Oct 30 11:10:51.170265> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobInfo()
<Oct 30 11:10:51.171160> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobArgs to get job arguments from job table
<Oct 30 11:10:56.274541> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobInfo()

I assume that the NANs have something to do with the problem, but would appreciate any suggestions for figuring this one out.

There is also some print out to do with node communication - not sure if it is helpful at all but I will paste it here in case.

node 1 to 1 data 8000
node 1 to 1 data 8000
map done 1 1720 0 8000 0 1720 8000 8000
map done 1 1720 0 8000 0 1720 8000 8000
map done 1 1720 0 8000 0 1720 8000 8000
POSCAR, INCAR and KPOINTS ok, starting setup
grid set up return 1 40 40 40 8400
<Oct 30 13:57:10.967152> BRIDGE (Debug): rm_get_job() - calling BGLDB::getJobArgs to get job arguments from job table
grid set up return 1 40 40 40 8400
grid set up return 1 20 20 20 1100
grid set up return 1 40 40 40 8400
call to genlay
gen_layout 1 1720 8000 8000
call to genind
WARNING: wrap around errors must be expected
gen_index done 1 460
node 1 to 1 data 8000
map done 1 1720 0 8000 0 1720 8000 8000
mapset aug done
node 1 to 1 data 1100
node 2 to 1 data 1100
node 4 to 1 data 1100
node 3 to 1 data 1100
node 2 to 2 data 0
node 1 to 2 data 0
node 4 to 2 data 0
node 3 to 2 data 0
node 1 to 3 data 0
node 4 to 3 data 0
node 2 to 3 data 0
node 3 to 3 data 0
node 2 to 4 data 0
node 1 to 4 data 0
node 4 to 4 data 0
node 3 to 4 data 0
map done 2 1100 0 1100 0 1100 1100 0
map done 3 1100 0 1100 0 1100 1100 0
map done 4 1100 0 1100 0 1100 1100 0
map done 1 1100 0 1100 0 1100 1100 8000
mapset soft done
node 2 to 1 data 0
node 4 to 1 data 0
node 3 to 1 data 0
<Oct 30 13:57:10.968080> BE_MPI (Debug): Job 778 switched to state RUNNING ('R'). Still waiting...
node 1 to 1 data 8400
node 1 to 2 data 0
node 4 to 2 data 0
node 3 to 2 data 0
node 2 to 2 data 8400
node 2 to 3 data 0
node 4 to 3 data 0
node 1 to 3 data 0
node 3 to 3 data 8400
node 1 to 4 data 0
node 3 to 4 data 0
node 2 to 4 data 0
node 4 to 4 data 8400
map done 1 8400 0 8400 0 8400 8400 16000
map done 4 8400 0 8400 0 8400 8400 16000
map done 2 8400 0 8400 0 8400 8400 16000
map done 3 8400 0 8400 0 8400 8400 16000
mapset wave done
node 2 to 1 data 0
node 4 to 1 data 0
node 3 to 1 data 0
node 1 to 1 data 8400
node 1 to 2 data 0
node 4 to 2 data 0
node 3 to 2 data 0
node 2 to 2 data 8400
node 2 to 3 data 0
node 4 to 3 data 0
node 1 to 3 data 0
node 3 to 3 data 8400
node 1 to 4 data 0
node 3 to 4 data 0
node 2 to 4 data 0
node 4 to 4 data 8400
map done 2 8400 0 8400 0 8400 8400 16000
map done 1 8400 0 8400 0 8400 8400 16000
map done 3 8400 0 8400 0 8400 8400 16000
map done 4 8400 0 8400 0 8400 8400 16000
allocation done


Here is the makefile......
.SUFFIXES: .inc .f .F
#-----------------------------------------------------------------------
# Makefile for linux on Blue Fern
#
# =======================
# radial.F must be compiled with -O2
# nonl.F must be compiled with -O
# paw.F compiled with -O1
#
# ZHEEVX is not working properly, so please uncomment the line
# #define USE_ZHEEVX
# in subrot.F and wavepre_noio.F
#
#-----------------------------------------------------------------------

# all CPP processed fortran files have the extension .f
SUFFIX=.f

#-----------------------------------------------------------------------
# fortran compiler and linker
#-----------------------------------------------------------------------
FC=blrts_xlf90
FCL=$(FC)

#-----------------------------------------------------------------------
# C-preprocessor define any of the flags given below
# NGXhalf charge density reduced in X direction
# wNGXhalf gamma point only reduced in X direction
# CACHE_SIZE 5001 for SP3 and Power 3
# 32768 for 550,590,3CT
# 8001 595/397 quad word systems
#-----------------------------------------------------------------------
CPP_ = /usr/bin/cpp -P -C



#-----------------------------------------------------------------------
# general fortran flags, none required
#-----------------------------------------------------------------------

FFLAGS = -g

#-----------------------------------------------------------------------
# optimization:
# optimise for the machine on which the code is compiled
#-----------------------------------------------------------------------
# for blue gene
OFLAG = -O3 -qarch=440d -qtune=440 -qipa -q32 -qfree=f90 -qessl \
-qstrict -qhot
#-----------------------------------------------------------------------

OFLAG_HIGH = $(OFLAG)
OBJ_HIGH = none
OBJ_NOOPT = none
DEBUG =
INCS = -I/bgl/BlueLight/ppcfloor/bglsys/include/
INLINE = $(OFLAG) -Q+dfro1,+dfro2,+dfq1,+dfq2,+fun,+expw,+cpw,+CORLSD,+GCOR,+cpwsp

# just in case of testing the f77 fft routines
FFLAGS_F77= -qautodbl=dblpad -qdpc=e -O3 -qarch=auto

#-----------------------------------------------------------------------
# options for linking
# the following option increases the size of the data frame
#-----------------------------------------------------------------------
#LINK = -Wl,-bD:1000000000 -qipa -v

LINK = -g

#-----------------------------------------------------------------------
# specify 3d-fft to be used with VASP
# fft3dessl is usually fastes on the IBM, however fft3dfurth comes
# very close and faster for 2^n
#-----------------------------------------------------------------------


# FFT: fftmpi.o with fft3dlib of Juergen Furthmueller
FFT3D = fftmpi.o fftmpi_map.o fft3dlib.o



#-----------------------------------------------------------------------
# fortran linker for mpi:
#-----------------------------------------------------------------------

FC=blrts_xlf90
FCL=$(FC)

#-----------------------------------------------------------------------
# additional options for CPP in parallel version (see also above):
# NGZhalf charge density reduced in Z direction
# wNGZhalf gamma point only reduced in Z direction
# scaLAPACK use scaLAPACK (usually slower on 100 Mbit Net)
#-----------------------------------------------------------------------

CPP = $(CPP_) -DMPI -Dessl -DNGZhalf -DHOST=\"IBML\" -DMPI_BLOCK=500\
-Dkind8 -DCACHE_SIZE=0 -Davoidalloc \
-DRPROMU_DGEMV -DRACCMU_DGEMV \
-DWAVECAR_double -Ddebug \
$*.F >$*$(SUFFIX)




#-----------------------------------------------------------------------
# libraries for mpi
#-----------------------------------------------------------------------

LIB = \
-L../vasp.4.lib_bf -ldmy \
../vasp.4.lib_bf/lapack_double.o \
../vasp.4.lib_bf/linpack_double.o \
-L/opt/ibmmath/lib/ -lesslbg \
-L/bgl/BlueLight/ppcfloor/bglsys/lib/ \
-lmpich.rts -lmsglayer.rts -lrts.rts -ldevices.rts \




#-----------------------------------------------------------------------
# general rules and compile lines
#-----------------------------------------------------------------------
BASIC= symmetry.o symlib.o lattlib.o random.o

SOURCE= base.o mpi.o smart_allocate.o xml.o \
constant.o jacobi.o main_mpi.o scala.o \
asa.o lattice.o poscar.o ini.o setex.o radial.o \
pseudo.o mgrid.o mkpoints.o wave.o wave_mpi.o $(BASIC) \
nonl.o nonlr.o dfast.o choleski2.o \
mix.o charge.o xcgrad.o xcspin.o potex1.o potex2.o \
metagga.o constrmag.o pot.o cl_shift.o force.o dos.o elf.o \
tet.o hamil.o steep.o \
chain.o dyna.o relativistic.o LDApU.o sphpro.o paw.o us.o \
ebs.o wavpre.o wavpre_noio.o broyden.o \
dynbr.o rmm-diis.o reader.o writer.o tutor.o xml_writer.o \
brent.o stufak.o fileio.o opergrid.o stepver.o \
dipol.o xclib.o chgloc.o subrot.o optreal.o davidson.o \
edtest.o electron.o shm.o pardens.o paircorrection.o \
optics.o constr_cell_relax.o stm.o finite_diff.o \
elpol.o setlocalpp.o


vasp: $(SOURCE) $(FFT3D) $(INC) main.o
rm -f vasp
$(FCL) -o vasp $(LINK) main.o $(SOURCE) $(FFT3D) $(LIB)
makeparam: $(SOURCE) $(FFT3D) makeparam.o main.F $(INC)
$(FCL) -o makeparam $(LINK) makeparam.o $(SOURCE) $(FFT3D) $(LIB)
zgemmtest: zgemmtest.o base.o random.o $(INC)
$(FCL) -o zgemmtest $(LINK) zgemmtest.o random.o base.o $(LIB)
dgemmtest: dgemmtest.o base.o random.o $(INC)
$(FCL) -o dgemmtest $(LINK) dgemmtest.o random.o base.o $(LIB)
ffttest: base.o smart_allocate.o mpi.o mgrid.o random.o ffttest.o $(FFT3D) $(INC)
$(FCL) -o ffttest $(LINK) ffttest.o mpi.o mgrid.o random.o smart_allocate.o base.o $(FFT3D) $(LIB)
kpoints: $(SOURCE) $(FFT3D) makekpoints.o main.F $(INC)
$(FCL) -o kpoints $(LINK) makekpoints.o $(SOURCE) $(FFT3D) $(LIB)

clean:
-rm -f *.g *.f *.o *.L *.mod ; touch *.F

main.o: main$(SUFFIX)
$(FC) $(FFLAGS)$(DEBUG) $(INCS) -c main$(SUFFIX)
xcgrad.o: xcgrad$(SUFFIX)
$(FC) $(FFLAGS) $(INLINE) $(INCS) -c xcgrad$(SUFFIX)
xcspin.o: xcspin$(SUFFIX)
$(FC) $(FFLAGS) $(INLINE) $(INCS) -c xcspin$(SUFFIX)

makeparam.o: makeparam$(SUFFIX)
$(FC) $(FFLAGS)$(DEBUG) $(INCS) -c makeparam$(SUFFIX)

makeparam$(SUFFIX): makeparam.F main.F
#
# MIND: I do not have a full dependency list for the include
# and MODULES: here are only the minimal basic dependencies
# if one strucuture is changed then touch_dep must be called
# with the corresponding name of the structure
#
base.o: base.inc base.F
mgrid.o: mgrid.inc mgrid.F
constant.o: constant.inc constant.F
lattice.o: lattice.inc lattice.F
setex.o: setexm.inc setex.F
pseudo.o: pseudo.inc pseudo.F
poscar.o: poscar.inc poscar.F
mkpoints.o: mkpoints.inc mkpoints.F
wave.o: wave.inc wave.F
nonl.o: nonl.inc nonl.F
nonlr.o: nonlr.inc nonlr.F
fftw3.o: fftw3.f

$(OBJ_HIGH):
$(CPP)
$(FC) $(FFLAGS) $(OFLAG_HIGH) $(INCS) -c $*$(SUFFIX)
$(OBJ_NOOPT):
$(CPP)
$(FC) $(FFLAGS) $(INCS) -c $*$(SUFFIX)

fft3dlib_f77.o: fft3dlib_f77.F
$(CPP)
$(F77) $(FFLAGS_F77) -c $*$(SUFFIX)

.F.o:
$(CPP)
$(FC) $(FFLAGS) $(OFLAG) $(INCS) -c $*$(SUFFIX)
.F$(SUFFIX):
$(CPP)
$(SUFFIX).o:
$(FC) $(FFLAGS) $(OFLAG) $(INCS) -c $*$(SUFFIX)

# special rules
#-----------------------------------------------------------------------

#$(FC) $(FFLAGS) $(INCS) -qoptimize=2 -O2 -c $*$(SUFFIX)
radial.o: radial.F
$(CPP)
$(FC) $(FFLAGS) $(INCS) -qoptimize=2 -c $*$(SUFFIX)

nonl.o: nonl.F
$(CPP)
$(FC) $(FFLAGS) $(INCS) -O -c $*$(SUFFIX)

paw.o: paw.F
$(CPP)
$(FC) $(FFLAGS) $(INCS) -O1 -c $*$(SUFFIX)

pseudo.o: pseudo.F
$(CPP)
$(FC) $(FFLAGS) $(INCS) -O1 -c $*$(SUFFIX)

Thanks!
Last edited by ng on Tue Oct 30, 2007 12:29 am, edited 1 time in total.

admin
Administrator
Administrator
Posts: 2921
Joined: Tue Aug 03, 2004 8:18 am
License Nr.: 458

parallel (mpi) problem on BG/L

#2 Post by admin » Wed Nov 07, 2007 2:42 pm

this rather looks like a problem with your parallelization software or the communication hardware than like a vasp-problem.
Last edited by admin on Wed Nov 07, 2007 2:42 pm, edited 1 time in total.

job
Jr. Member
Jr. Member
Posts: 55
Joined: Tue Aug 16, 2005 7:44 am

parallel (mpi) problem on BG/L

#3 Post by job » Tue Nov 20, 2007 12:31 pm

The lack of VPU time is probably due to getrusage() not returning all the info that you normally see on a full-featured OS. It shouldn't really affect any results.

If you still like to see it you can try my F95 timing routines instead of timing.c in vasp.4.lib

http://www.fyslab.hut.fi/~job/timing_f95.f90
Last edited by job on Tue Nov 20, 2007 12:31 pm, edited 1 time in total.

d-farrell2

parallel (mpi) problem on BG/L

#4 Post by d-farrell2 » Tue Nov 11, 2008 9:41 pm

I just wanted to bump this thread to see if anyone had ever sorted this issue out. I am seeing a very similar behavior in some large runs on a BG/P, but I haven't yet set up the code for debugging (want to see how reproduceable it is first)
Last edited by d-farrell2 on Tue Nov 11, 2008 9:41 pm, edited 1 time in total.

Post Reply