MLFF training stuck after first ionic step

Queries about input and output files, running specific calculations, etc.


Moderators: Global Moderator, Moderator

Locked
Message
Author
reach2sayan
Newbie
Newbie
Posts: 6
Joined: Sun Oct 16, 2022 9:49 pm

MLFF training stuck after first ionic step

#1 Post by reach2sayan » Fri Oct 21, 2022 3:26 pm

Hi,

I am trying to fit a MLFF using VASP. I wanted to check if my installations were perfect. So I tried an example. I have attached the INCAR, OUTCAR, POSCAR, ML_LOGFILE and the stdout here (also the ICONST since I want to sample a liquid phase at high T).

If I remove the ML tags and run just an MD, then everything is fine. In fact, I zeroed down the MD hyperparams (LANGEVIN_GAMMA etc) by doing just that. However when I start the ML training, then the training is stuck after the 1st set of electronic steps converge (as you can see from the output files). It stayed like that for like 6 hours before I canceled it.

I wonder what I'm doing wrong. Most probably it could be the installation itself? Thank you for the kind help.

Best
Sayan
You do not have the required permissions to view the files attached to this post.

reach2sayan
Newbie
Newbie
Posts: 6
Joined: Sun Oct 16, 2022 9:49 pm

Re: MLFF training stuck after first ionic step

#2 Post by reach2sayan » Fri Oct 21, 2022 3:40 pm

Sorry I saw I should also post KPOINTS and jobscript. I compiled it on SDSC PSC Bridges https://www.psc.edu/resources/bridges-2/

KPOINTS
Si
0 0 0
Gamma
4 4 4
0 0 0

jobscript
#!/bin/bash

#SBATCH -t 48:00:00
#SBATCH -p RM
#SBATCH --nodes 2
#SBATCH --ntasks-per-node=120

ulimit -s unlimited
module load intel intelmpi cuda hdf5 # same ones with which it was compiled
export OMP_NUM_THREADS=1

mpirun vasp.6.3.2/bin/vasp_std > vasp.out

ferenc_karsai
Global Moderator
Global Moderator
Posts: 460
Joined: Mon Nov 04, 2019 12:44 pm

Re: MLFF training stuck after first ionic step

#3 Post by ferenc_karsai » Thu Oct 27, 2022 9:40 am

I just ran your calculation it ran without any problem. I also tried it with 8 and 128 cores and it ran fine.
So it is most likely a problem of your installation.

Try the following:
-) Compile without scaLAPACK (remove -DscaLAPACK from your CPP_OPTIONS in the makefile.include).
-) Compile wihout shared memory (remove -Duse_shmem in CPP_OPTIONS).

You used 240 in your calculation.
Don't use so many It's enough to try it with 8 cores.

I also saw that you have TEBEG=1800 and TEEND=800 in your calculation. Never run cooling runs in on-the-fly machine learning. Always use heating runs. Otherwise the automatic threshold determination can get stuck. This is also explained on our best practices wiki page:
wiki/index.php/Best_practices_for_machi ... rce_fields

reach2sayan
Newbie
Newbie
Posts: 6
Joined: Sun Oct 16, 2022 9:49 pm

Re: MLFF training stuck after first ionic step

#4 Post by reach2sayan » Mon Nov 14, 2022 9:16 pm

Hi,

Possibly it was the issue with wither libbeef installation or I was running out of stack size. But now it is fixed. I also fixed the other issue you suggested (it was just a check to see if ML training worked).

Best
Sayan

ferenc_karsai
Global Moderator
Global Moderator
Posts: 460
Joined: Mon Nov 04, 2019 12:44 pm

Re: MLFF training stuck after first ionic step

#5 Post by ferenc_karsai » Tue Nov 15, 2022 8:21 am

Good to hear everything works now. Thanks for sharing your solution.

Locked