MLFF stages of training

Message

dylan_durkee · #1 Post by **dylan_durkee** » Mon Apr 24, 2023 3:26 pm

Hello,

I have questions about MLFF training in VASP which I am just beginning to try and use. I read on the MLFF VASP wiki that one can train complex systems in stages - I believe the example given was one in which a surface material was trained, and then a molecule was trained, and lastly the system including the surface and molecule as a whole was trained.

I want to look at a system which contains a molecular liquid with two different element types, and introduce a third element to the liquid in random positions throughout the liquid.

Should I do the MLFF training for such a system in stages, where I train the two-element molecular liquid first, and then introduce the third element into training ? If I do it like this, then can I use the ML_ABN file from the first round of training once I've updated the POSCAR and POTCAR files to include the third element?

Thanks for any help!

-Dylan

#2 Post by **ferenc_karsai** » Tue Apr 25, 2023 9:59 pm

Yes, exactly as you wrote.

dylan_durkee · #3 Post by **dylan_durkee** » Fri May 05, 2023 7:54 pm

Hello Ferenc,

I have a followup on my original post. The system I am looking at is liquid CO2 at 4000K with Re atoms randomly placed in the liquid.

I attempted training in stages like discussed in this post. I trained the pure liquid CO2 system first with 48 atoms total for 60k training steps. Then I introduced 16 Re atoms into the system, and copied over ML_ABN file to train the system with the additional element. I was able to train for 1k steps, but when I try to continue further training run into a 'BAD TERMINATION' error just before entering the main loop. Google tells me this may be related to memory, but I have not found a solution. I tried increasing as well as decreasing #cpus used but still have the issue. I'm happy to upload any files that will be useful for you to help me. It seems I can't drag any files into this message post though, so I'll need instruction there. Thanks in advance!

#4 Post by **ferenc_karsai** » Mon May 08, 2023 8:07 am

I also suspect memory. The first thing is to check if your VASP is compiled with shared memory. For that please check if "-Duse_shmem" is among the precompiler flags in the makefile.include file (see wiki/index.php/Shared_memory).

If not please upload the neccessary files (ML_LOGFILE, ML_AB, INCAR, KPOINTS, POTCAR, OUTCAR).

When you write your message ther is a binder "Attachments" next to "Options" where you can upload the files.

dylan_durkee · #5 Post by **dylan_durkee** » Mon May 08, 2023 8:20 pm

My VASP did not have the "-Duse_shmem" in the makefile while I have had this error. However, today we tried including this flag and then recompiled, but still get a similar error.

For some reason, I cannot upload the files in the "Attachments" binder. The message just reads 'Invalid file extension' whether its the INCAR, KPOINTS, etc. The same message occurs if I instead drag the files into the message box.

#6 Post by **ferenc_karsai** » Tue May 09, 2023 8:24 am

So you should check the memory estimation at the beginning of the ML_LOGFILE for both calculations. You should see a significant decrease in the required memory per core for the calculation with "-Duse_shmem". Especially for the entry "CMAT for basis" should be much smaller.
You should also check that the required memory per core fits into your computers memory.

For the attachments you need to pack all files as .zip. Only packed files and pictures are permitted as valid files. You can find this info also in the forum guidelines.

dylan_durkee · #7 Post by **dylan_durkee** » Tue May 09, 2023 11:27 am

Okay I see, I have attached files now as zip. Thank you for clarifying.

The ML_LOGFILE does seem to have reduced memory especially in the 'CMAT for basis' as you said. The files I attached are for after we compiled with "-Duse_shmem". I attached the vasp output with the errors before and after compiling "-Duse_shmem" for comparison. After compiling "-Duse_shmem", we get different segmentation fault errors preceding 'BAD TERMINATION'.

#8 Post by **ferenc_karsai** » Fri Jun 02, 2023 12:56 pm

Sorry for the late reply, but I had really busy times.

I've tried to run your calculation and I think you run out of memory.
I've run your calculation but only using machine learning for refitting (ML_MODE=REFIT) and it works fine. It needs approximately 2 Gb of memory per core on 64 cores. The memory for the machine learning part is written in the beginning of the ML_LOGFILE. If I include your ab-initio settings and try to run with continuation runs for on-the-fly learning I run out of memory with 64 cores with 1 Tb memory.
Unfortunately the correct memory requirement for the ab-initio side is not written out, but your cut-off of 910 eV seems very high.

So I think your problem is that machine learning combined with precise ab-initio parameters runs out of memory.

Please try to reduce the ab-initio parameters to fit the calculation into memory.

dylan_durkee · #9 Post by **dylan_durkee** » Thu Sep 28, 2023 6:23 pm

Thank you for your previous responses, it helped me to test lowering the cutoff energy. I have recently started looking at using MLFF in VASP again, and had an additional question that is relevant to the 'MLFF stages of training' of this original post so I thought I would make a reply in the same post.

I tried doing training a force field in different sessions, where each subsequent session is just continuing from the last training session by copying ML_ABN to ML_AB, etc. I noticed if I plot the training errors from the ML_LOGFILE (the BEEF, ERR, and CTIFOR as done in the 'Monitoring' section in the 'Best practices for MLFF' vasp manual page) for these continued training sessions, there seems to be a discontinuity at the MD step where the continuation begins. I am confused why this is here, since I am just continuing the training on the same force field. Is this behavior in the training errors expected for continuing MLFF training sessions? I attached a plot in the .zip file for reference. Thanks in advance.

thanh-nam_huynh · #10 Post by **thanh-nam_huynh** » Fri Oct 06, 2023 11:44 am

Hi,

I am not an expert, but it seems that you didn't change the ML_CTIFOR values in INCAR after continuation training. Because of that, the CTIFOR was reset to the default value, which is 0.002, explaining the drops at 5k and 10k steps.
I would suggest setting the ML_CTIFOR as the value from the last CTIFOR value of the previous training, which you can find from BEEF lines in ML_LOGFILE.

Cheers,

#11 Post by **andreas.singraber** » Fri Oct 06, 2023 3:35 pm

Hello!

Good catch, thanks Thanh-Nam! That is exactly what is happening, see also the comment on our best practices page. As suggested, the current threshold value for the Bayesian error estimation CTIFOR you can find in the ML_LOGFILE in column 6 labelled "threshold" of the BEEF output, e.g. here:

Code: Select all

# BEEF ########################################################################################################################
# BEEF This line shows the Bayesian error estimations and the current threshold criterion.
# BEEF 
# BEEF nstep ............ MD time step or input structure counter
# BEEF bee_energy ....... BEE of energy per atom (eV atom^-1)
# BEEF bee_max_force .... BEE of forces (max) (eV Angst^-1)
# BEEF bee_ave_force .... BEE of forces (average) (kB)
# BEEF threshold ........ Current value of threshold criterion (eV Angst^-1)
# BEEF bee_max_stress ... BEE of stresses (max) (kB)
# BEEF bee_ave_stress ... BEE of stresses (average) (kB)
# BEEF ########################################################################################################################
# BEEF              nstep       bee_energy    bee_max_force    bee_ave_force        threshold   bee_max_stress   bee_ave_stress
# BEEF                  2                3                4                5                6                7                8
# BEEF ########################################################################################################################
BEEF                    0   0.00000000E+00   0.00000000E+00   0.00000000E+00   2.00000000E-03   0.00000000E+00   0.00000000E+00
BEEF                    1   1.30768979E-05   6.90944059E-02   3.69707959E-02   2.00000000E-03   1.56460353E+00   1.10511278E+00
.....

Best,
Andreas Singraber

dylan_durkee · #12 Post by **dylan_durkee** » Fri Oct 06, 2023 8:07 pm

Okay I see the issue now, thank you both for the response!

mghosh · #13 Post by **mghosh** » Tue Feb 04, 2025 7:54 pm

The ML_CTIFOR values are not written after continuation of one run. Meaning, after gripping 'THRUPD', the values are not written any more. How do you continue in this case.

My Community

MLFF stages of training

MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training

Re: MLFF stages of training