Initialization of design matrix failed
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 9
- Joined: Sun Jun 26, 2022 6:32 am
Initialization of design matrix failed
Hi VASP team,
Hope the message finds you well. I am currently work on the MLFF train of ab initio data.
When I use previous ML_ABN file (renamed to ML_AB) to restart, VASP say: ERROR, First Initialization of design matrix (FFM%FMAT) failed.
For the same input files, the job works (starts well) on the same GPU in supercomputer center.
It maybe due to the different installation. Why my design matrix initialization fail on the local GPU ? Where is the possible problem and solution ?
(The files are attached please.)
Thank you a lot for the time,
Jie
Hope the message finds you well. I am currently work on the MLFF train of ab initio data.
When I use previous ML_ABN file (renamed to ML_AB) to restart, VASP say: ERROR, First Initialization of design matrix (FFM%FMAT) failed.
For the same input files, the job works (starts well) on the same GPU in supercomputer center.
It maybe due to the different installation. Why my design matrix initialization fail on the local GPU ? Where is the possible problem and solution ?
(The files are attached please.)
Thank you a lot for the time,
Jie
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 126
- Joined: Mon May 08, 2023 4:08 pm
Re: Initialization of design matrix failed
Hi Jie,
I'll try to assist you with your problem, but first, I would require a bit more information. Are you using the same version of VASP in both cases? What kind of GPUs are you running on? Could you please also attach the OUTCAR files and standard output of the two runs?
I'll try to assist you with your problem, but first, I would require a bit more information. Are you using the same version of VASP in both cases? What kind of GPUs are you running on? Could you please also attach the OUTCAR files and standard output of the two runs?
Manuel
VASP developer
VASP developer
-
- Newbie
- Posts: 9
- Joined: Sun Jun 26, 2022 6:32 am
Re: Initialization of design matrix failed
Hi Manuel,
Thank you for your reply.
Both VASP are version 6.4.2. Both GPU are A100.
The OUTCAR1 from local GPU is attached, the OUTCAR from supercomputer center is too large to upload, therefore
I extracted the first 100,000 rows and named it OUTCAR2.
Jie
Thank you for your reply.
Both VASP are version 6.4.2. Both GPU are A100.
The OUTCAR1 from local GPU is attached, the OUTCAR from supercomputer center is too large to upload, therefore
I extracted the first 100,000 rows and named it OUTCAR2.
Jie
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 126
- Joined: Mon May 08, 2023 4:08 pm
Re: Initialization of design matrix failed
Perfect, thanks. I will look into it.
Manuel
VASP developer
VASP developer
-
- Global Moderator
- Posts: 126
- Joined: Mon May 08, 2023 4:08 pm
Re: Initialization of design matrix failed
I consulted with our machine-learning experts. It is likely that you run out of memory on your local machine. The error message you encounter is generated from a failed allocation statement in the code.
The ML_LOGFILE contains information regarding the memory requirements of the calculation. Could you please also attach this file? How much memory do you have available on your local and on the remote machine?
The ML_LOGFILE contains information regarding the memory requirements of the calculation. Could you please also attach this file? How much memory do you have available on your local and on the remote machine?
Manuel
VASP developer
VASP developer
-
- Newbie
- Posts: 9
- Joined: Sun Jun 26, 2022 6:32 am
Re: Initialization of design matrix failed
Hi Manuel,
It should not due to the memory. I tried another of local machine with smaller memory and it runs well.
Local machine has 80 GB, same with remote machine.
File is attached please.
Jie
It should not due to the memory. I tried another of local machine with smaller memory and it runs well.
Local machine has 80 GB, same with remote machine.
File is attached please.
Jie
You do not have the required permissions to view the files attached to this post.
-
- Newbie
- Posts: 9
- Joined: Sun Jun 26, 2022 6:32 am
Re: Initialization of design matrix failed
Hi Manuel,
I wondering whether this is due to the different version of HPC SDK and cuda.
The local GPU A100 run with NVHPC 23.1 and cuda 12.0, while the others (works) with cuda 11.x.
Is it possible for you help to check whether this job run well on NVHPC 23.1 and cuda 12.0 on VASP 6.4.2 ? Therefore may
have a clue to direction of search.
Thanks,
Jie
I wondering whether this is due to the different version of HPC SDK and cuda.
The local GPU A100 run with NVHPC 23.1 and cuda 12.0, while the others (works) with cuda 11.x.
Is it possible for you help to check whether this job run well on NVHPC 23.1 and cuda 12.0 on VASP 6.4.2 ? Therefore may
have a clue to direction of search.
Thanks,
Jie
-
- Global Moderator
- Posts: 126
- Joined: Mon May 08, 2023 4:08 pm
Re: Initialization of design matrix failed
Unfortunately, the best advice I can give you is to not use the GPU version of VASP to run the machine-learning code. The ML code does not benefit from GPU parallelization and is, in fact, untested when running VASP on GPU. The error you encounter might be directly related to this. Could you please try to run the code on CPU only and see if the error persists?
Last edited by manuel_engel1 on Tue Feb 06, 2024 9:55 am, edited 1 time in total.
Reason: make it clear that only the GPU + ML combination is untested
Reason: make it clear that only the GPU + ML combination is untested
Manuel
VASP developer
VASP developer
-
- Newbie
- Posts: 9
- Joined: Sun Jun 26, 2022 6:32 am
Re: Initialization of design matrix failed
Hi Manuel,
You mean I can run the CPU version of VASP on multi node CPU cores for the machine learning code ?
(Is using multi core CPU more efficient than GPU when running the ML_MODE = select, refit and production run,
any recommendations for the efficiency in each ML_MODE stage ?)
Tried with the CPU version of VASP on CPU only, the previous error disappeared. However, the GPU is much faster for pure ab initio calculations.
Sorry for one more question about merging different ML_AB files, on vasp wiki: https://www.vasp.at/wiki/index.php/ML_AB
It recommends: strongly advise to group structures with the same number of elements and atoms per element in the training data
About group structures, does it mean: for the combined ML_AB, always use one modified Header specification, then simply put, for example, atom number 48 structures for configuration numbers 1 to 10; then atom number 50 structures for configuration numbers 11 to 20, so it is total 20 structures, Configuration num. 1 to Configuration num. 20. Not necessary to do other things.
Thanks a lot for the help,
Jie
You mean I can run the CPU version of VASP on multi node CPU cores for the machine learning code ?
(Is using multi core CPU more efficient than GPU when running the ML_MODE = select, refit and production run,
any recommendations for the efficiency in each ML_MODE stage ?)
Tried with the CPU version of VASP on CPU only, the previous error disappeared. However, the GPU is much faster for pure ab initio calculations.
Sorry for one more question about merging different ML_AB files, on vasp wiki: https://www.vasp.at/wiki/index.php/ML_AB
It recommends: strongly advise to group structures with the same number of elements and atoms per element in the training data
About group structures, does it mean: for the combined ML_AB, always use one modified Header specification, then simply put, for example, atom number 48 structures for configuration numbers 1 to 10; then atom number 50 structures for configuration numbers 11 to 20, so it is total 20 structures, Configuration num. 1 to Configuration num. 20. Not necessary to do other things.
Thanks a lot for the help,
Jie
-
- Global Moderator
- Posts: 126
- Joined: Mon May 08, 2023 4:08 pm
Re: Initialization of design matrix failed
No worries, I hope I can clear things up.
For the additional question regarding the merging of ML_AB files, please open a new topic in the forum.
Yes.You mean I can run the CPU version of VASP on multi node CPU cores for the machine learning code ?
The ML code does not use GPU parallelization. It is currently a CPU-only code.Is using multi core CPU more efficient than GPU when running the ML_MODE = select, refit and production run,
any recommendations for the efficiency in each ML_MODE stage ?
That is true. Unfortunately, you are currently restricted to CPU with ML in VASP. And it seems that trying to run ML calculations with a GPU involved will produce errors so I advice against it.However, the GPU is much faster for pure ab initio calculations.
For the additional question regarding the merging of ML_AB files, please open a new topic in the forum.
Manuel
VASP developer
VASP developer