Page 1 of 1
Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Posted: Wed Oct 04, 2023 5:52 am
by dantasqu
I'm facing an issue in VASP's MLFF with the 'Refit' mode. Despite knowing that there was a problem with 'Incorrect MLFF fast-mode predictions for some triclinic geometries,' I've updated to the latest version and continue to experience the same problem.
With my current simulation, I experience an empty OSZICAR file, and the output file progresses only up to 'initializing machine learning' before remaining stuck indefinitely, regardless of the simulation runtime I set.
I've also attempted to address the issue by reducing the 'ML_AB' configurations (I have reduced it to half of the configurations), as the current number is quite high, but it hasn't yielded any changes in the output results.
I also have read other posts but it seems that many people had some results after the simulation "stopped", so I wonder if this could be yet another memory allocation problem or something else that I could troubleshoot.
I've posted the necessary files for reference on the link below (too large to attach), and any input or guidance on resolving this persistent problem would be greatly appreciated.
Link:
https://drive.google.com/drive/folders/ ... sp=sharing
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Posted: Wed Oct 25, 2023 9:15 am
by ferenc_karsai
So I've run your calculation on 64 cores. After a few hours I get the following:
"xxmr2d:out of memory"
This is inside the scalapack routines for SVD where it redistributes some routines internally. For that it allocates helping arrays that are allocated with malloc. If the size of the helping arrays (which is unfortunately 1D) is larger than 2**31, that means 4 byte integer, this error message comes. The size of this arrays gets smaller and smaller the more computational cores one takes, since the arrays are distributed via the cores and each core only needs to allocate parts of the arrays.
So I reran the calculation with 128 cores and it went through fine.
You ran on 40 cores (I saw it from the OUTCAR) which is definitely not enough, but it's strange you don't get an error.
Please try the calculation with more cores, to be safe at least with 128.
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Posted: Mon Nov 27, 2023 8:42 am
by jelle_lagerweij
This question and answer was very useful for me as well. I got the same problem when refitting an FF for a liquid phase system. At first, I was quite surprised, as I only used 650 GB of the nearly 1500 GB of memory available, but I still got this error message. However, now I understand that this issue occurs because of allocating the array in the memory instead of the absolute memory size. I am testing the solution (using more cores) and will see how this does in the future. However, I must note that it would be nice if this error could be avoided by adjusting the ML algorithm, as this will cause me to use more high memory nodes than I strictly need memory wise. Using more cores to have shorter arrays on the separate cores does not feel like an appropriate long term solution
.
Regards,
Jelle
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Posted: Tue Dec 12, 2023 3:14 pm
by ferenc_karsai
The clean fix for this will come when scaLAPACK will officially change from integer4 to integer8. This will completely solve the problem.
Until then there is not much we can do, since we absolutely need the parallel SVD solvers from scaLAPACK. It's also hard to know in advance when this problem occurs, so writing warnings is also not easy.
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Posted: Fri Dec 15, 2023 7:11 pm
by dantasqu
Hey everyone, thank you for the inputs on the problem. I've been trying to run on 128 cores like suggested but I still have some problems with it. Could you recommend a compiler and MPI to try? I've tried the compiled versions below:
FIRST:
module load intel/19.0.4
module load intel-mpi
module load intel-mkl
module load cuda
SECOND:
module load gcc/11.3.0
module load openmpi/4.1.4
module load hdf5/1.12.2
module load intel-oneapi/2021.3
module unload intel-oneapi-mpi/2021.3
Thanks,
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Posted: Thu Dec 28, 2023 10:04 am
by ferenc_karsai
I don't think it's a problem of the compilers. It rather depends on the size of your calculation. If you have a huge calculation then possibly 128 cores are also not enough. So my suggestion is to try with more cores maybe 256 or more until the problem goes away.
If it still does not help then try this toolchain:
Intel fortran 22.0.1 with Intel MPI 21.5.0