OOM Error GW0-VASP5.4.4
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 21
- Joined: Thu Jun 27, 2024 3:43 pm
OOM Error GW0-VASP5.4.4
Hi,
I am trying to run a single-shot GW calculation after generating WAVEDER and WAVECAR. I used the attached KPOINTS, POTCAR, POSCAR, and INCAR.
After writing the fermi-level value and reading the WAVECAR, I receive an OOM error.
I use a machine with 128 cores/node and 256 GB/node.
Those trials did not change the OOM error:
1. reducing PREC to single and normal.
2. reducing NOMEGA
3. reducing NBANDSGW
4. using LSPECTRAL = .False.
5. using ISYM=-1 (actually increased the memory demand)
I am using NBANDS=1024, so increasing the number of cores above 1024 increases the NBANDS and gives the same OOM error.
Increasing the number of nodes up to 24 nodes with 1024 cores did not affect "the total amount of memory used by VASP MPI-rank0" and gave the same error.
Note: this study requires a high level of accuracy, this is why, I use a high NBANDS and a high KPOINTS resulting from the convergence study.
Any suggestions for this error?
Thank you in advance.
Basant
I am trying to run a single-shot GW calculation after generating WAVEDER and WAVECAR. I used the attached KPOINTS, POTCAR, POSCAR, and INCAR.
After writing the fermi-level value and reading the WAVECAR, I receive an OOM error.
I use a machine with 128 cores/node and 256 GB/node.
Those trials did not change the OOM error:
1. reducing PREC to single and normal.
2. reducing NOMEGA
3. reducing NBANDSGW
4. using LSPECTRAL = .False.
5. using ISYM=-1 (actually increased the memory demand)
I am using NBANDS=1024, so increasing the number of cores above 1024 increases the NBANDS and gives the same OOM error.
Increasing the number of nodes up to 24 nodes with 1024 cores did not affect "the total amount of memory used by VASP MPI-rank0" and gave the same error.
Note: this study requires a high level of accuracy, this is why, I use a high NBANDS and a high KPOINTS resulting from the convergence study.
Any suggestions for this error?
Thank you in advance.
Basant
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: OOM Error GW0-VASP5.4.4
Dear Basant,
You have a large supercell and hence a large number of plane waves in the basis set. There is no distribution of the response function over plane waves in ALGO=GW, so the full response function has to be stored in memory. You should try to reduce the number of MPI ranks per node so that each MPI rank has more memory available.
You could also try to reduce the basis set for the response function ENCUTGW, but one should be careful as the default is usually the optimal choice.
You have a large supercell and hence a large number of plane waves in the basis set. There is no distribution of the response function over plane waves in ALGO=GW, so the full response function has to be stored in memory. You should try to reduce the number of MPI ranks per node so that each MPI rank has more memory available.
You could also try to reduce the basis set for the response function ENCUTGW, but one should be careful as the default is usually the optimal choice.
-
- Newbie
- Posts: 21
- Joined: Thu Jun 27, 2024 3:43 pm
Re: OOM Error GW0-VASP5.4.4
Thank you for your reply.
I tried:
1. reducing KPOINTS..... Decreased the required memory per MPI, but still the same error.
2. reducing the Vacuum slab..... Decreased the required memory per MPI, but still the same error.
3. reducing the number of MPI ranks per node..... Increased the required memory per MPI, but still the same error.
reducing KPOINTS, Vacuum slab, and increasing the number of Nodes to 32 (8 TB) with small MPI cores (512)..... the job ran without error. However, it froze for 3 hours in the 'calculate exact exchange contribution' step without writing another thing. I terminated the job, and I wonder if it would have written anything if I kept it running.
Best,
Basant
I tried:
1. reducing KPOINTS..... Decreased the required memory per MPI, but still the same error.
2. reducing the Vacuum slab..... Decreased the required memory per MPI, but still the same error.
3. reducing the number of MPI ranks per node..... Increased the required memory per MPI, but still the same error.
reducing KPOINTS, Vacuum slab, and increasing the number of Nodes to 32 (8 TB) with small MPI cores (512)..... the job ran without error. However, it froze for 3 hours in the 'calculate exact exchange contribution' step without writing another thing. I terminated the job, and I wonder if it would have written anything if I kept it running.
Best,
Basant
Last edited by basant_ali on Mon Jul 08, 2024 6:14 pm, edited 1 time in total.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: OOM Error GW0-VASP5.4.4
Considering the size of the job you are trying to calculate, I would assume that it didn't freeze but takes a very long time to perform the calculation.
I would suggest that you reduce the size of the job as much as possible, i.e., single k-point, fewer bands, reduce ENCUT, ENCUTGW, and NOMEGA and make sure that this calculation runs through. Then you should gradually increase the parameters of your calculation.
I would suggest that you reduce the size of the job as much as possible, i.e., single k-point, fewer bands, reduce ENCUT, ENCUTGW, and NOMEGA and make sure that this calculation runs through. Then you should gradually increase the parameters of your calculation.
-
- Newbie
- Posts: 21
- Joined: Thu Jun 27, 2024 3:43 pm
Re: OOM Error GW0-VASP5.4.4
Hi,
I tried reducing KPOINTS, ENCUT, NBANDS, NOMEGA, NBANDSGW, vacuum slab
The total amount of memory used by VASP MPI-rank0 is now 167144. kBytes, while I am providing 2TB/MPI rank!!
However, the run is frozen at the same step for 12 hrs
"calculate exact exchange contribution"
Please, let me know if VASP cannot solve GW0 for 28 atoms and if I need to use another software.
The used KPOINTS, ENCUT, NBANDS, NOMEGA, NBANDSGW, and vacuum slab are the least parameters I can use before the job loses its accuracy completely. And 28 atoms are the least number of atoms in my cell. I attached the new input files.
Thank you in advance.
Best,
Basant
I tried reducing KPOINTS, ENCUT, NBANDS, NOMEGA, NBANDSGW, vacuum slab
The total amount of memory used by VASP MPI-rank0 is now 167144. kBytes, while I am providing 2TB/MPI rank!!
However, the run is frozen at the same step for 12 hrs
"calculate exact exchange contribution"
Please, let me know if VASP cannot solve GW0 for 28 atoms and if I need to use another software.
The used KPOINTS, ENCUT, NBANDS, NOMEGA, NBANDSGW, and vacuum slab are the least parameters I can use before the job loses its accuracy completely. And 28 atoms are the least number of atoms in my cell. I attached the new input files.
Thank you in advance.
Best,
Basant
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: OOM Error GW0-VASP5.4.4
Could you please provide the input/output files for this reduced calculation you are trying to perform?
The problem is that you have quite a lot of vacuum which increases the required basis set dramatically. One should keep in mind that ALGO=GW has a scaling of N^4 with the system size.
In your OUTCAR I see that you have around 3*10^5 plane waves in the basis set, so that would be around 10^5 plane waves in the response function basis set. Each rank has to store at least a full matrix \chi(G,G') which amounts to 160Gb and that should be multiplied with by the number of frequencies. So I would say that this system is probably too large for the standard GW algorithm, but it mainly depends on the resources you have available. Also, I would recommend you to test this calculation first without the vacuum and see if you can get the converged results with the bulk system and only then proceed with the surface calculation.
In VASP6 we also have a low-scaling GW algorithm which scales as N^3 with the system size, so this approach should be potentially much faster, but also requires a lot of memory.
The problem is that you have quite a lot of vacuum which increases the required basis set dramatically. One should keep in mind that ALGO=GW has a scaling of N^4 with the system size.
In your OUTCAR I see that you have around 3*10^5 plane waves in the basis set, so that would be around 10^5 plane waves in the response function basis set. Each rank has to store at least a full matrix \chi(G,G') which amounts to 160Gb and that should be multiplied with by the number of frequencies. So I would say that this system is probably too large for the standard GW algorithm, but it mainly depends on the resources you have available. Also, I would recommend you to test this calculation first without the vacuum and see if you can get the converged results with the bulk system and only then proceed with the surface calculation.
In VASP6 we also have a low-scaling GW algorithm which scales as N^3 with the system size, so this approach should be potentially much faster, but also requires a lot of memory.
-
- Newbie
- Posts: 21
- Joined: Thu Jun 27, 2024 3:43 pm
Re: OOM Error GW0-VASP5.4.4
Hi Alex,
Thank you for your reply. Here is what I found about the problem face:
1. I have a very large system that requires the use of Algo=Normal instead of Algo=Exact for the first step of the GW calculation.
2. When I run the first step of the GW calculation with NELM=1, it gives a positive total energy and the system becomes false metal.
3. The second step of the GW calculation takes a very long time to converge metals, and different tags should be used.
This was solved as follows:
1. I used NELM=100 instead of NELM=1 in the first step of the GW calculation, which allowed the conversion of the system and gave a reasonable band gap of 1.5 eV, similar to the experimental results.
2. The second step of the GW gave another OOM error after reading the WAVEDER, this was solved by using 50 nodes and 560 cores providing more than 20GB/core (MPI rank).
However, I have another question:
The job is still running and the CHI and KRAMKRO are calculated several times. I wonder if this is correct and if I need to adjust anything.
Thank you,
Basant
Thank you for your reply. Here is what I found about the problem face:
1. I have a very large system that requires the use of Algo=Normal instead of Algo=Exact for the first step of the GW calculation.
2. When I run the first step of the GW calculation with NELM=1, it gives a positive total energy and the system becomes false metal.
3. The second step of the GW calculation takes a very long time to converge metals, and different tags should be used.
This was solved as follows:
1. I used NELM=100 instead of NELM=1 in the first step of the GW calculation, which allowed the conversion of the system and gave a reasonable band gap of 1.5 eV, similar to the experimental results.
2. The second step of the GW gave another OOM error after reading the WAVEDER, this was solved by using 50 nodes and 560 cores providing more than 20GB/core (MPI rank).
However, I have another question:
The job is still running and the CHI and KRAMKRO are calculated several times. I wonder if this is correct and if I need to adjust anything.
Thank you,
Basant
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: OOM Error GW0-VASP5.4.4
This first step is a DFT calculation which is required to find the orbitals and orbitals derivatives for the subsequent GW calculation. In this DFT calculation, the full SCF convergence has to be achieved, i.e., a single iteration (NELM=1) is not enough. See the guide on how to set up a GW calculation in VASP on our wiki.1. I used NELM=100 instead of NELM=1 in the first step of the GW calculation, which allowed the conversion of the system and gave a reasonable band gap of 1.5 eV, similar to the experimental results.
Yes, it is correct. Note also, that at this point of your calculation you are still calculating the first q-point and you have 25 k-points in the IBZ.The job is still running and the CHI and KRAMKRO are calculated several times. I wonder if this is correct and if I need to adjust anything.