Page 1 of 1

ML force field training does not progress

Posted: Tue Aug 13, 2024 8:37 am
by akretschmer
I am trying to train a ML force field for AIMD with graphite. But when the main loop starts, nothing happens and after 3 days the job aborts due to the time limit of the cluster.

EDIT: I am using vasp.6.4.2

INCAR:

Code: Select all

SYSTEM = graphene
ENCUT = 550
IBRION = 0
ISIF = 3
NSW = 100
EDIFF = 1e-6
EDIFFG = 1e-5
ISMEAR = 1
SIGMA = 0.2
PREC = Accurate
ALGO = FAST
LREAL  = Auto
LWAVE  = .FALSE.        !write WAVECAR (def T)
LCHARG = .TRUE.        !write CHGCAR (def T)
NCORE = 2

IVDW = 12

ML_LMLFF = .TRUE.
POTIM = 0.7
MDALGO = 3
ISYM = 0
ML_MODE = train
TEBEG = 50
TEEND = 500
I tried the same with other xc functionals but they all fail the same way. I relaxed the cell before with static AI which runs perfectly fine. I then just add the last code block in the INCAR file.

POSCAR:

Code: Select all

Graphite
   1.0000000000000000
     7.4030347517527781   -0.0111052851320072    0.0159637274031600
    -3.7026587113806686    6.4131930112247382    0.0006787546983971
    -0.0000000000000000    0.0000000000000002   13.4433740916905080
   C
    72
Direct
  0.0049278469335001  0.0024455047286062  0.1252358593113222
 -0.0007109408973012 -0.0008817066211251  0.6240798432840473
  0.0057835149373610  0.3352854444422640  0.1248845911781240
 -0.0024288457598892  0.3334761305940993  0.6227157596021295
  0.0056777522820779  0.6705103407953577  0.1247834411600615
 -0.0037804529108982  0.6657771079223536  0.6244463858538861
  0.3396929031149822  0.0032408933870816  0.1252927693876168
  0.3312449051078522 -0.0032995018962406  0.6234079387789914
  0.3391128084622168  0.3339467305997547  0.1265875199713704
  0.3308643961844996  0.3322887469818847  0.6236279494970725
  0.3374444372869301  0.6675729690985164  0.1271281112655019
  0.3312824223503720  0.6659573084352595  0.6234714993320106
  0.6711900550506326  0.0026275769658387  0.1249047189506541
  0.6652487561523698 -0.0011190686765061  0.6226569736164909
  0.6741870084050987  0.3372091342224170  0.1261005097108367
  0.6634041769965733  0.3314381201595334  0.6229545484584281
  0.6713049103122155  0.6687163506957010  0.1245989025324702
  0.6634224418383139  0.6655925929113597  0.6233121542442099
  0.0085656175418687 -0.0002636903059656  0.3687874258165127
  0.0049255024652602  0.0024063734478703  0.8764352481874031
  0.0067971933649898  0.3322612763213928  0.3704382375466629
  0.0056206404654441  0.3371443372841954  0.8773704978551391
  0.0079664359150852  0.6658137344357871  0.3683180423549069
  0.0051548649432684  0.6696964301109145  0.8763691547647932
  0.3407996485864724  0.0008049412391801  0.3675210170564341
  0.3394673665100049  0.0027199387934017  0.8763994323005698
  0.3403238589293950  0.3326906770325044  0.3681057314301978
  0.3391904444225973  0.3359591581637849  0.8747233893362275
  0.3417223323281676  0.6670986666757702  0.3682899691247028
  0.3386844864779097  0.6699644799180259  0.8761319696073786
  0.6730679249343270 -0.0016036249220819  0.3693117932714712
  0.6732955261519441  0.0036141804644993  0.8769228820002187
  0.6742516243213496  0.3328447706408325  0.3681570751986207
  0.6735315505826842  0.3372005113332451  0.8758646131204946
  0.6739319882003703  0.6663518268579599  0.3680627293048768
  0.6709395841566800  0.6700250301374353  0.8761194495350685
  0.2274932196082184  0.1128086522023745  0.1266672800376033
  0.2206087137491038  0.1097578293189827  0.6248329222927036
  0.2278922090841584  0.4445912150011435  0.1267045734700731
  0.2199254165458235  0.4430531523243142  0.6240898179035792
  0.2275382483493668  0.7795187129914795  0.1268633289051972
  0.2190711442435241  0.7757190930410361  0.6245821350676810
  0.5607630741469803  0.1131138601821316  0.1258736027732899
  0.5536509258719490  0.1095526817655896  0.6234479313227643
  0.5627304975370685  0.4469026561266214  0.1270103871239664
  0.5522735812247701  0.4442735697334139  0.6234282328623040
  0.5594230164553252  0.7786408786840243  0.1261711359877175
  0.5539978387859763  0.7778729838502607  0.6236854052256970
  0.8934179450135686  0.1130363555711553  0.1257033626365640
  0.8894571620127282  0.1121126257972904  0.6232254847528574
  0.8960843208236453  0.4477623470817549  0.1246607226163121
  0.8847737148460817  0.4433698583875044  0.6237321301406060
  0.8942406257776696  0.7806786120743753  0.1240870420758763
  0.8871480091699188  0.7787414333359274  0.6250895482944877
  0.1187539064375385  0.2211299307261630  0.3690276325830065
  0.1155861219429878  0.2244436472881211  0.8760486011449945
  0.1179867036361861  0.5547247563743948  0.3693579762748144
  0.1150339138033622  0.5599802825929336  0.8769218147655224
  0.1194028179518858  0.8892487973452687  0.3685166815854137
  0.1160625222523202  0.8924670331506304  0.8754098046236934
  0.4519514524437534  0.2207169334854960  0.3685663516469894
  0.4490730911104343  0.2253180250450681  0.8750536056720808
  0.4523514101548700  0.5554240648100396  0.3685471906471366
  0.4470630976123472  0.5573373674207622  0.8749391262064576
  0.4523750051610730  0.8874116411786455  0.3685958136462503
  0.4501586553917471  0.8924734228393962  0.8761060833236406
  0.7857233558462281  0.2215002121497795  0.3700024566542869
  0.7840160470304917  0.2268659351347516  0.8764746558520714
  0.7865816953110767  0.5553536880688398  0.3681545304547837
  0.7808866041887622  0.5583601258270514  0.8759012408518100
  0.7838530910026997  0.8879276403193670  0.3690368494450083
  0.7816062435294621  0.8912044822654043  0.8774607633464009


The main loop part in the ML_LOGFILE contains just this single line, which tells me that it runs an ab initio step as it should:

Code: Select all

--------------------------------------------------------------------------------
STATUS                  0 threshold  2      T      F         0         0


The OSZICAR is likewise empty.

The end of the OUTCAR file stops at the first iteration of the first ionic step:

Code: Select all

 ML FREE ENERGIE OF THE ION-ELECTRON SYSTEM (eV)
  ---------------------------------------------------
  free  energy ML TOTEN  =         0.00000000 eV

  ML energy  without entropy=        0.00000000  ML energy(sigma->0) =        0.00000000

      MLFF:  cpu time      0.3316: real time     21.2640


--------------------------------------- Iteration      1(   1)  ---------------------------------------


    POTLOK:  cpu time      1.2962: real time     65.8013
    SETDIJ:  cpu time      0.1713: real time     11.0067



What am I doing wrong?

Re: ML force field training does not progress

Posted: Tue Aug 13, 2024 9:48 am
by ferenc_karsai
Please upload all neccessary files (POSCAR, POTCAR, KPOINTS, ML_AB, INCAR, OUTCAR, OSZICAR, ML_LOGFILE and stdout) according to the forum guidelines.

Re: ML force field training does not progress

Posted: Tue Aug 13, 2024 10:49 am
by akretschmer
Here are the files. I cannot provide the ML_AB file as it was not generated.
PBE-D3.zip

Re: ML force field training does not progress

Posted: Fri Aug 16, 2024 2:27 pm
by ferenc_karsai
I just tried your example. I think you have too many k points (163 in the irreducible Brillouin zone) and the calculation needs forever.
Please try with a reduced number of k points (maybe make a convergence test on a single structure from bottom up).
Also try to parallelize over the k-points. For that start the calculation until NKPTS appears in the OUTCAR file (should take a few ten seconds).
Then stop the calculation. Set KPAR equal to the number you looked up in the INCAR file and then restart your calculation.
This should also noticeably speed up your calculation.

Re: ML force field training does not progress

Posted: Wed Sep 18, 2024 9:12 am
by akretschmer

I have reduced the number of k-points, but the problem still persists. When I set the KPAR tag to the value of NKPTS, I get an error and the calculation starts, using KPAR = 2 was the only option that has worked for me so far with the VSC5 (128 cores per node).

I have now tried a single cell with only 4 atoms and still not a single step is completed, so I guess there must be something else wrong. I have uploaded my latest try with the small cell.

Test.zip