G.1 Introduction

G.1.1 Overview of GPU Acceleration Philosophy in FHI-aims

FHI-aims uses a batch integration scheme[128] in which the real-space integration points are broken up into spacially localized batches of points. Each batch of points is assigned to an MPI rank, and each MPI rank processes its assigned batches sequentially. After all batches have been processed, the MPI ranks communicate the final results to one another.

The batch integration scheme is at the heart of FHI-aims’ O(N) scaling in number of atoms for most of the steps of the SCF cycle. (An important exception is the solution of the Kohn-Sham equations, which is handled by the ELSI infrastructure.) Only basis elements that touch an integration point will contribute to the quantity being calculated for a given batch. As basis elements have finite spacial extent, for a sufficiently large non-periodic system or unit cell of a periodic system, the number of basis elements needed for a given fixed-sized batch will be saturated. Adding more atoms to the system, i.e. increasing the size of a non-periodic system or using a larger unit cell for a periodic system, will increase the number of batches linearly, but not the work done per batch, leading to linear scaling in number of atoms.

The batch integration scheme in FHI-aims lends itself naturally to GPU acceleration. The details vary based on the task being accelerated, but the general strategy is:

  1. 1.

    The MPI rank sets up a batch.

  2. 2.

    The MPI rank communicates batch details to its assigned GPU.

  3. 3.

    The GPU performs work on the batch.

  4. 4.

    If the MPI rank needs to process the batch further, the GPU communicates the results back to its assigned MPI rank.

  5. 5.

    After all batches have been processed, the GPU communicates its final results back to its assigned MPI rank.

As each MPI rank processes its batch independent of other MPI ranks, no significant effort is needed to use GPU acceleration in an MPI environment. The batches are small enough that they fit into memory on an NVIDIA GPU. As each batch is statistically similar in size, the memory usage of a given batch is independent of system size; the GPU will not run out of memory as the system size increases for a fixed number of MPI ranks. Furthermore, most of the computation time for tasks utilizing the batch integration scheme is taken up by a small number of BLAS/LAPACK subroutine calls occurring at the end of the batch processing. These subroutine calls can be easily replaced by cuBLAS (https://developer.nvidia.com/cublas) calls.

The pseudocode for this process is:

  do i_batch = 1, n_batches
    set_up_batch_on_cpu
    copy_batch_information_to_gpu
    call cuBLAS_Function()
    if gpu_data_needed_on_cpu
      copy_partial_gpu_data_back_to_cpu
      cpu_performs_work_on_partial_gpu_data
    end if
  end do

  copy_gpu_final_data_back_to_cpu

G.1.2 Current State of GPU Acceleration in FHI-aims

The steps needs to use GPU acceleration in FHI-aims are:

  1. 1.

    Make sure prerequisites are installed.

  2. 2.

    Compile FHI-aims with GPU support. This may be accomplished by using CMake or Makefile.

  3. 3.

    Add GPU acceleration keywords to control.in.

  4. 4.

    Run FHI-aims as normal.

It cannot be stressed enough that the user should consult the documentation for their architecture, as their architecture may require additional steps to use GPU acceleration beyond what is listed here.

The GPU acceleration code is considered stable and suitable for production calculations. An example scaling plot for timings of the first SCF step for 128 atoms of GaAs on Titan Cray XK7 is shown in Figure G.1. We generally find that the charge density update shows the largest GPU acceleration speed-up. Larger speed-ups are observed as the basis set size is increased. If a non-periodic system or unit cell of a periodic system is too small (say, a primitive cell of GaAs running on 32 MPI ranks), a slow-down may actually be observed.

Refer to caption
Figure G.1: Example scaling plot for GPU acceleration. The solid lines are CPU-only calculations, and the dotted lines are GPU-accelerated calculations. At present, there is no GPU acceleration in the Hartree multipole summation, so both CPU-only and GPU-accelerated calculations have the same timings for this task.

The list of tasks we have GPU accelerated natively in FHI-aims is:

  • Integration of the Hamiltonian matrix

  • Charge density update via density matrices

  • Pulay forces

  • Stress tensor

  • RI-V 3-center integration

In the future, we plan to natively GPU accelerate the following tasks:

  • Hartree multipole summation

  • Construction of the Fock matrix (for Hartree-Fock, hybrid-functional, and beyond)