3.47 Large-scale, massively parallel: Memory use, sparsity, communication, etc.
In one way or another, most options available with FHI-aims concern physical algorithms or numerical choices, including those affecting the accuracy and/or efficiency of a given task. As much as possible, FHI-aims attempts to use the exact same code and settings to describe any given system, and on any kind of computer hardware. Usually, this guarantees efficient code on all ends, and improvements for one class of systems immediately benefit all others
However, for very large tasks [meaning here several hundred up to thousands of atoms, depending on the system] and/or on parallel architectures with possibly (ten)thousands of processors, memory and communication constraints may come into play that require specific workarounds not needed (or beneficial) in normal systems. On a practical level, FHI-aims attempts to be as memory-parallel as possible without loss of efficiency. However, if any such modification could affect the performance for normal systems and computer architecture adversely, we recommend to switch it on separately only when needed.
On massively parallel machines, and for very large problems, the following options are particularly important:
-
•
If KS_method parallel is used, keyword collect_eigenvectors .false. switches off the collection of the full eigenvectors on each CPU, saving both communication time and a significant amount of memory. This is the default where possible but the default is currently switched silently between .false. (efficient) and .true. (inefficient) depending on the details of the calculation.
-
•
Keyword use_local_index, which ensures that integration grid batches on the same CPU are always close together, requiring only a subset of the packed Hamiltonian matrix as working space for integrals on each CPU. Use load_balancing to improve load-balancing in this case to avoid the a negative impact on the efficiency. This is now set by default for systems containing 100 or more atoms.
-
•
The Kerker preconditioner (part of the density mixing step is switched on by default for all periodic calculations, but does not scale well to large processor counts and with system size. If the density mixing step costs a significant amount of time (see FHI-aims timings, printed at the end of each s.c.f. step in the output), consider switching off the Kerker preconditioner. For systems with a band gap, it is often not needed and the much cheaper default Pulay mixer will work as well.
-
•
In the cluster case, keyword use_density_matrix_hf should be used for system sizes of a few hundred atoms and above. (This is the default for all periodic systems anyway).
-
•
For the cluster case, and if KS_method parallel is used, keyword packed_matrix_format may save a significant amount of memory in the construction of the overlap, Hamiltonian, and density matrices. (This is the default for all periodic systems anyway).
-
•
For the cluster case with more than 200 atoms, the non-periodic Ewald method can be used to accelerate the calculation, see section 3.7.1.
-
•
The keyword distributed_spline_storage avoids to store the complete multipolar decomposition of the charge density on each processor. Instead, they are only stored for those atoms with local grid points in reach.
-
•
For beyond-GGA: Keyword prodbas_nb can be used to enhance the distribution of memory intensive arrays, possibly sacrificing performance.
Finally: There is a keyword, use_alltoall, that allows to switch the communication behaviour of FHI-aims for very large runs (recommended > 1000 tasks) when parallel linear algebra (KS_method parallel) is used. Note that testing should be done on your hardware as to whether setting use_alltoall to .true. confers any benefit, and note that it comes with increased memory demand.
Tags for general section of control.in:
Tag: collect_eigenvectors(control.in)
Usage: collect_eigenvectors flag
Purpose: When KS_method parallel is used, allows to switch
off the collection of all eigenvectors to each CPU, saving memory and
computation time.
Restriction: Local copies of the eigenvectors are needed for some
post-processing functionality.
flag is a logical string, either .false. or
.true. Default: .false. wherever possible.
For memory efficiency reasons, this keyword should be set to .false. whenever possible. Otherwise, especially the largest and most demanding calculations will run out of memory. When .true., this keyword creates a complete and up-to-date copy of every eigenvector for the k-point assigned to it on every MPI task. When .false., this information is communicated only when unavoidable. In principle, every operation in FHI-aims should be implemented so as to keep this keyword set to .false. and it is only human time that keeps us from doing it.
Tag: distributed_spline_storage(control.in)
Usage: distributed_spline_storage flag
Purpose: Request to store the multipolar decomposition of the density only for
the atoms needed on a given processor for the corresponding parts of the
Hartree potential.
flag is a logical string, either .false. or
.true. Default: .false.
This flag is .false. by default because the complete splined density might be needed for some postprocessing or output option.
Tag: force_mpi_virtual_topo(control.in)
Usage: force_mpi_virtual_topo flag
Purpose: Auxiliary option to try to force the MPI library to respect
the topology of the nodes used (several tasks within each
shared-memory node vs. slower communication between different
nodes)
flag is a logical string, either .false. or
.true. Default: .false.
If requested, enables to cache the (deduced) topology of the nodes to the default communicator. In principle, the communication layer should then obtain more information on the network topology and organize the communication pattern more efficiently. However, nothing is guaranteed. The standard does not force MPI to respect the cached information, and in all decent MPI library implementations, this information should already be provided by the system.
In short: Perhaps try this if a truly strange communication pattern is observed, but probably, there will be no effect.
Tag: load_balancing(control.in)
Usage: load_balancing flag
Purpose: Using the keyword use_local_index has a negative impact
on the distribution of work across CPUs for real-space grid operations. When
load balancing is enabled via this keyword, this performance hit can be
avoided by explicit reassignment of grid point batches to processors according
to timings of test runs.
flag is a logical string, either .false. or
.true. (or if_scalapack, see below). Default: .false.
This feature, which was implemented by Rainer Johanni, eliminates the negative impact on performance incurred by the use_local_index keyword. It should always be used whenever memory becomes a bottleneck for a calculation. For more information about what this keyword does, please see the documentation for the use_local_index keyword, as the two keywords are closely intertwined.
Load balancing requires that KS_method parallel be used: that is, there is more than one CPU for cluster calculations or more CPUs than k-points for periodic calculations. If the if_scalapack option is supplied, load balancing will be turned on when KS_method parallel is being used and will be turned off when KS_method serial is being used.
When this keyword is set to .true. or if_scalapack, the use_local_index keyword will be set to the same value by default.
Tag: packed_matrix_format(control.in)
Usage: packed_matrix_format type
Purpose: Allows to use a packed-matrix format for the Hamiltonian,
overlap, and density matrices of the real-space basis functions.
type is a string that indicates the type of packing
used. Default: none for the cluster case, index
for periodic geometries.
The following options exist for type:
-
•
none - no packing is used
-
•
index - matrices are packed by strictly eliminating all near-zero elements of all three matrices. Elements are eliminated if both the overlap matrix element and the initial Hamiltonian matrix element are smaller than a threshold set by keyword packed_matrix_threshold.
Packing the overlap, Hamiltonian and density matrices reduces the size of these arrays during matrix integration, at the expense of some small effort to correctly sort intermediate results during the integration appropriately.
From a technical point of view, packing only makes sense if the full Hamiltonian is not required later anyway, during the eigenvalue solution. This is the case:
Tag: packed_matrix_threshold(control.in)
Usage: packed_matrix_threshold tolerance
Purpose: Tolerance value below which the elements of the overlap /
Hamiltonian matrices are eliminated from the
packed_matrix_format.
tolerance : A small positive real numerical value. Default:
0.0.
We have set the new default to 0.0 instead of . In our experience, this change doesn’t introduce any significant computational or memory overhead. More importantly, it fixes several problems we noticed in the past, including SCF instabilities and incorrect results for quantities depending on the real-space momentum matrix or the dipole matrix.
Tag: prune_basis_once(control.in)
Usage: prune_basis_once flag
Purpose: Stores the indices of the non-zero basis functions for
each integration batch in memory
flag is a logical string, either .false. or
.true. Default: .true.
All operations for the integrations and the electron density update are operations, but verifying which basis functions are non-zero for each integration batch requires to check each basis function, and is thus an operation with a small prefactor. This step can be avoided by checking for the non-zero basis functions once, and then storing their indices in memory for each batch of integration points. For very large systems and restricted memory, it may be worth trying to switch this feature off to save some memory, otherwise this should always be done.
Tag: store_EV_to_disk_in_relaxation(control.in)
Usage: store_EV_to_disk_in_relaxation flag
Purpose: During relaxation, eigenvectors from a previous geometry
can be stored to disk instead of in memory in case the next step
is reverted.
flag is a logical string, either .false. or
.true. Default: .false.
During relaxation (see relax_geometry), geometry steps can be rejected if the total energy increased unexpectedly. In order to revert to a previous step, it is necessary to access the Kohn-Sham eigenvectors used to initialize that step, which must hence be stored for that purpose. In normal calculations, eigenvector storage is not a problem, but their size grows as with system size. For very large system sizes, their storage in memory can become a bottleneck, which is here circumvented by the option to store them to disk.
This keyword is not related to the restart functionality of the restart keyword, since it is an old, not a current, set of Kohn-Sham eigenvectors that may be needed when reverting a geometry step.
Tag: use_2d_corr(control.in)
Usage: use_2d_corr flag
Purpose: Allows to switch on or off the two-dimensional distribution of data
structures for correlated methods.
flag is a logical string, either .false. or
.true. Default: .true..
Only relevant for correlated beyond-hybrid methods.
Tag: use_alltoall(control.in)
Usage: use_alltoall flag
Purpose: Allows to switch some communication calls in FHI-aims,
depending on the number of tasks.
flag is a logical string, either .false. or
.true. Default: .false.
Only relevant for KS_method parallel.
When running on a system with many CPUs, it may be faster to use “all-to-all” communcation (mpi_alltoallv) than doing n_tasks times a sendrecv call. This should be tested on your particular hardware and MPI library though, as it could be the case that no benefit is offered, or it could come with worse performance. It was, for example, noticeably more performant on the BlueGene/P with 105 MPI tasks at a time. For much fewer tasks, the effect is not usually relevant.
Also note that “all-to-all” costs a significant amount of memory, and this memory pressure will be felt for much fewer MPI tasks (e.g., Intel architecture with hundreds of CPU cores at a time.
Tag: use_mpi_in_place(control.in)
Usage: use_mpi_in_place flag
Purpose: Allows some collective communication calls to be handled
using the “MPI_IN_PLACE” flag of the MPI specification.
flag is a logical string, either .false. or
.true. Default: .true.
Only relevant for parallel runs.
When running on a system with many CPUs, it can be a bit more memory efficient to use “MPI_IN_PLACE” communication. Then, instead of separate send- and receive buffers, a single buffer is used and information is updated “in place”.
However, not all MPI implementations seem to implement this feature correctly. Thus, this choice can sometimes lead to problems or even errors in the results, depending on the MPI library and version that was used.
The code checks for specific problems related to “MPI_IN_PLACE” and switches to use_mpi_in_place .false. if problems are encountered.
If anyone identifies further problems related to the use of “MPI_IN_PLACE”, please let us know.
Tag: use_local_index(control.in)
Usage: use_local_index flag
Purpose: Reduces work space size for Hamiltonian / overlap matrix
during integrals by storing only those parts that are touched by
any grid points assigned to the present CPU.
Restriction: Supported for standard LDA , GGA, and hybrid functionals using
KS_method parallel and packed matrices, but not for some
non-standard options.
flag is a logical string, either .false. or
.true. (or if_scalapack, see below). Default: .false. below 100 atoms, .true. for 100 and above.
Originally, FHI-aims stored its real-space Hamiltonian/overlap matrices in a non-distributed fashion, i.e. every CPU has the full copies of the real-space Hamiltonian/overlap matrices. While this makes the math easier internally, it will lead to a memory bottleneck for large calculations: as the number of atoms in a calculation increases, the sizes of the real-space matrices increase, and eventually the real-space matrices are too large to fit in memory. Adding more CPUs to a calculation does not fix this problem, since each CPU has the full copies of the matrices.
For very large systems with many CPUs, it thus becomes necessary to spread the real-space matrices across CPUs. This is done using a method known as “domain decomposition” (or, informally, local indexing), where we assign batches of integration grid points close to one another to the same CPU. Each CPU then stores only the portions of the real-space Hamiltonian/overlap matrices which have non-zero support on the integration points which it possesses. The real-space matrices are thus distributed across CPUs, considerably decreasing their sizes. Should a calculation still suffer from memory issues associated with the sizes of the real-space matrices, we can increase the number of CPUs to reduce the memory overhead on each individual CPU.
While the domain decomposition method will spread the real-space matrices across CPUs and eliminates the memory bottleneck previously mentioned, eventually we will need to solve the Kohn-Sham eigenvalue problem. To solve the Kohn-Sham eigenvalue problem, we need to generate the Hamiltonian entering into the eigensolver from the real-space Hamiltonian, which is now distributed across CPUs. The subsequent merge of all results into the BLACS infrastructure used by the eigensolvers supported by FHI-aims then becomes more difficult, and some performance overhead results from the altered load on each CPU. This option is switched on by default for systems containing 100 or more atoms, but not all current functionality supports it. If any conflicting keywords are found at runtime, local indexing is not set.
To overcome the performance overhead associated with use_local_index keyword, it is *strongly* recommended that the user also try using the load_balancing keyword, which enables load balancing. Load balancing will eliminate the overhead associated with the use_local_index keyword, but it is not enabled for all non-standard functionality in FHI-aims, hence why we do not enable it by default.
Domain decomposition requires KS_method parallel: that is, there is more than one CPU for cluster calculations or more CPUs than k-points for periodic calculations. If the if_scalapack option is supplied, domain decomposition will be turned on when KS_method parallel is being used and will be turned off when KS_method serial is being used.
Domain composition also requires prune_basis_once and a parallel
grid_partitioning_method. These are enabled by default for FHI-aims
calculations, so you shouldn’t worry about setting them yourself.
Tag: walltime(control.in)
Usage: walltime seconds
Purpose: Can limit the wall clock time spent by FHI-aims explicitly,
e.g., to obtain the correct final output before a queuing system
shuts down a calculation.
seconds is the integer requested wall clock time in
seconds. Default: no limit.
In order to reach the postprocessing phase and write information required at the end of a calculation in an organized manner, FHI-aims can force a stop before a certain amount of real time (wall clock time) is exceeded. In order to achieve this safely, FHI-aims uses an internal estimate of the duration of a single s.c.f. iteration within the calculation, and stops if it estimates that the next iteration will take more time than what remains available.