G.4 Running FHI-aims with GPU Acceleration
Compiling the FHI-aims executable with GPU acceleration support will not automatically turn on GPU acceleration. To use GPU acceleration when running FHI-aims, the user specifies which tasks should be GPU accelerated independently using the control.in keywords below:
Tag: use_gpu(control.in)
Usage: use_gpu flag
Purpose: Use GPU acceleration methods that are considered stable. flag is optional. It can be either .true. or .false.. When not present, .true. is assumed. This keyword currently enables gpu_density, gpu_hamiltonian, gpu_forces, and elsi_elpa_gpu. These keywords can also be used individually to turn on GPU acceleration in a specific part of FHI-aims. The ELPA eigensolver doesn’t support HIP yet and this keyword should not be used for calculations that use both HIP and ELPA. Use gpu_density, gpu_hamiltonian and gpu_forces instead.
Tag: gpu_density(control.in)
Usage: gpu_density flag
Purpose: Use GPU acceleration when updating the charge density via density matrices. flag is optional. It can be either .true. or .false.. When not present, .true. is assumed. This keyword does nothing when using orbital-based density update.
Tag: gpu_hamiltonian(control.in)
Usage: gpu_hamiltonian flag
Purpose: Use GPU acceleration when integrating the Hamiltonian matrix. flag is optional. It can be either .true. or .false.. When not present, .true. is assumed.
Tag: gpu_forces(control.in)
Usage: gpu_forces flag
Purpose: Use GPU acceleration when calculating the Pulay forces and analytical stress tensor. flag is optional. It can be either .true. or .false.. When not present, .true. is assumed.
Tag: gpu_riv(control.in)
Usage: gpu_riv flag
Purpose: Use GPU acceleration for the 3-center Integrals in the RI-V method. flag is optional. It can be either .true. or .false.. When not present, .true. is assumed.
GPU aceleration of the ELPA eigensolver is controlled by elsi_elpa_gpu. See also Section 3.9. This keyword is not yet supported by HIP.
One important keyword when running GPU-accelerated calculations is points_in_batch, which sets the targeted number of points per batch. This parameter is a trade-off: increasing the number of points per batch increases the work done by the GPU per batch, increasing the efficiency of the GPU, but it also increases the number of basis elements interacting with a batch, increasing the memory usage. Due to technical details, some of this additional work is unnecessary, as it does not appreciably add to the integrals being evaluated.
The default value for points_in_batch based on early CPU-only benchmarks was set to 100. We have found that increasing this value to 200 is a better choice for our test architecture (Kepler Tesla GPUs) when using GPU acceleration, and we have set 200 as the default value when running any GPU-accelerating any tasks involving the batch integration scheme. The user should also play around with this parameter for their own architecture, particularly if they are using a different GPU architecture.
All GPU keywords may be set independently. We have found that the charge density update shows a significantly higher GPU-accelerated speed-up than the Hamiltonian integration (c.f. Figure G.1). If the user’s architecture uses fast CPUs but slow GPUs, enabling GPU acceleration may actually slow down the calculation.
G.4.1 Memory Usage with GPU Acceleration
One hypothetical limitation in the current implementation of GPU acceleration in FHI-aims is GPU memory usage. All MPI ranks are assigned to one of the available GPUs, implying that a GPU will generally have more than one MPI rank assigned to it. All MPI ranks will offload their work during the compute-intensive cuBLAS call onto the assigned GPU. This creates two bottlenecks: not only do MPI ranks need to “wait their turn” behind other MPI ranks before the GPU processes their current batch, but each MPI rank will take up a portion of the GPU’s memory. If a calculation runs out of memory when using GPU acceleration, some possible solutions are:
-
•
Read Section 3.47, “Large-scale, massively parallel: Memory use, sparsity, communication, etc.” of the manual. In particular, consider setting use_local_index to .true. . While there will be a time cost associated with enabling this keyword, the memory savings can be considerable.
-
•
Use less MPI ranks per node. By having less MPI ranks per node, less MPI ranks will be bound to the GPUs on each node, reducing the overall GPU memory usage.
Summary
-
•
Use keywords in control.in to enable GPU acceleration.
-
•
Optimize points_in_batch for the architecture used.
-
•
Test each GPU acceleration keyword individually to make sure there is a speed up compared to the CPU-only version for the architecture used.
-
•
Try use_local_index .true. if your calculation runs out of memory. (This is true for all calculations, not just GPU-accelerated calculations.)