MarkusBattarbee1 KonstantinosPapadakis1 UrsGanse1
JaroHokkanen2 LeoKotipalo1 YannPfau-Kempf1 MarkkuAlho1 and MinnaPalmroth1,41 University of Helsinki, Department of Physics, P.O. Box 4, 00014 University of Helsinki, Finland
2 CSC – IT CENTER FOR SCIENCE LTD. P.O. Box 405 FI-02101 Espoo, Finland
4 Finnish Meteorological Institute, Space and Earth Observation Centre, PO Box 503, 00101 Helsinki, Finlandmarkus.battarbee@helsinki.fi
Abstract
Vlasiator is a space plasma simulation code which models near-Earth ion-kinetic dynamics in three spatial and three velocity dimensions. It is highly parallelized, modeling the Vlasov equation directly through the distribution function, discretized on a Cartesian grid, instead of the more common particle-in-cell approach. Modeling near-Earth space, plasma properties span several orders of magnitude in temperature, density, and magnetic field strength. In order to fit the required six-dimensional grids in memory, Vlasiator utilizes a sparse block-based velocity mesh, where chunks of velocity space are added or deleted based on the advection requirements of the Vlasov solver. In addition, the spatial mesh is adaptively refined through cell-based octree refinement. In this paper, we describe the design choices of porting Vlasiator to heterogeneous CPU/GPU architectures. We detail the memory management, algorithmic changes, and kernel construction as well as our unified codebase approach, resulting in portability to both NVIDIA and AMD hardware (CUDA and HIP languages, respectively). In particular, we showcase a highly parallel block adjustment approach allowing efficient re-ordering of a sparse velocity mesh. We detail pitfalls we have overcome and lay out a plan for optimization to facilitate future exascale simulations using multi-node GPU supercomputing.
1 Introduction
Vlasiator is a hybrid-Vlasov space plasma simulation code, specifically designed to model kinetic plasma dynamics of the near-Earth environment [Palmroth etal., 2018, Ganse etal., 2023]. Vlasiator has been run on some of the largest supercomputers in the world, as due to the high fidelity of its model, simulations are computationally costly, typically exceeding millions of CPU-hours for a few tens of minutes of simulated physical time. Vlasiator, in its CPU iteration, utilizes vector instructions (AVX2, AVX512), OpenMP threading, as well as the message passing interface [MPI, Walker and Dongarra, 1996] to run on distributed hardware, supporting several hundreds of nodes and tens of thousands of MPI tasks. The field of supercomputing has seen a transition from nodes utilizing standard central processing units (CPUs) to ones including a variety of general purpose graphics processing units (GPGPUs, or GPUs for short) [Asaduzzaman etal., 2021] for acceleration of mathematical operations and access to high-bandwidth memory [HBM, Zhu etal., 2018].
Porting a supercomputer simulation code to heterogeneous GPU/CPU hardware has a two-fold goal. A first objective is to be able to run simulations on a hardware previously not supported by the codebase, yet offering tempting performance potential. A second, yet in many ways more important objective is to refactor and optimize the code so the new hardware can be exploited to its fullest, facilitating faster, longer, or more detailed simulations than would be possible with previous-generation CPU hardware. Altering the simulation code to support GPU hardware can be achieved through various approaches discussed later in this manuscript. In this paper, we focus on two approaches, which are the Compute Unified Device Architecture [CUDA, Luebke, 2008] by NVIDIA corporation, and the C++ Heterogeneous-Compute Interface for Portability (HIP) by Advanced Micro Devices, Inc.(AMD) which acts as an interface to the lower-level ROCm software stack.
In this paper, we describe the architectural choices and resultant changes implemented in the Vlasiator simulation code during its first stages of porting to support GPU architectures via the CUDA and HIP/ROCm interfaces.
1.1 Vlasiator
Hybrid ion-kinetic plasma simulations such as Vlasiator model the evolution of ions, assuming electrons are a massless charge-neutralizing fluid. Vlasiator simulates collisionless plasma physics through evolving the ion (usually proton) velocity distribution function (VDF) directly instead of relying on macroparticle sampling as used by particle-in-cell [PIC, Nishikawa etal., 2021]) approaches. The evolution of the VDF is evaluated through solving the collisionless Boltzmann equation, also known as Vlasov equation [Vlasov, 1961],
(1) |
where is the phase-space density for ion species (with charge and mass ), and and are the electric and magnetic fields, respectively. Using a Vlasov approach has many benefits such as a representation of VDFs free of sampling noise and the capability to simulate correct scale separation between the magnetic obstacle of the Earth and ion-kinetic effects, but is numerically expensive to compute. Vlasiator simulates the global magnetospheric environment of Earth with the simulation domain containing inflowing solar wind up to a distance of several tens of Earth radii (), the bow shock and foreshock regions, the magnetosheath and the cusps, as well as the flanks and the magnetotail to a distance of several tens of .
Vlasiator utilizes explicit solvers with a Strang splitting [Strang, 1968, Einkemmer and Ostermann, 2014] approach, decomposing evolution of (1) into six Cartesian shear operations; three in velocity space (referred to as acceleration), and three in spatial space (referred to as translation). The Lorentz force effect on the plasma distribution (first and third terms of equation 1) can be (in the non-relativistic case) described as gyration in a frame of reference where the motional electric field cancels out the existing electric field. Using the SLICE-3D [Zerroukat and Allen, 2012] algorithm, this is deconstructed into three Cartesian direction aligned shear transformations.
1.2 Discretization
Vlasiator discretizes ion VDFs on Cartesian grids, storing and propagating phase-space average values in units of . The simulation domain consists of a 2D or 3D spatial Cartesian mesh, which can be either uniform, or use cell-based mesh refinement with an octree approach. For 2D simulations, the third dimension consists usually of a thickness of a single cell with periodic boundary conditions. Non-unitary thicknesses (so-called 2.9D simulations with a cylindrical inner boundary) are also supported, such as those used for a study of dayside reconnection with multiple X-lines [Pfau-Kempf etal., 2020]. When simulating in three spatial dimensions, the octree approach is used to refine low-resolution cubic cells into 8 children each with half the dimensions of the parent. Within a neighbourhood of 3 cells in each direction, refinement level can change by only one level, to prevent too drastic resolution boundaries. Refinement can be repeated several times, with current simulations utilizing up to 4 different resolution levels. Details of the AMR approach used in Vlasiator are presented in Papadakis etal. [2022], Ganse etal. [2023] and Kotipalo etal. [2024]. Typical Vlasiator 2D (or 5D; 2D+3V=5D) simulation spatial resolutions have been between and , slightly above the ion inertial length in the solar wind, but capable of modelling ion-kinetic effects as shown in Pfau-Kempf etal. [2018]. In 6D runs, the highest spatial resolution in regions of interest has been which is above the ion inertial length, but those simulations have been shown to still include significant ion-kinetic dynamics [Grandin etal., 2023, Palmroth etal., 2023].
The electric and magnetic fields are solved on a Cartesian uniform Yee-lattice [Yee, 1966] grid, with resolution matching that of the highest refinement level of the Vlasov grid. This approach ensures the field solver encounters no resolution boundaries, which could potentially lead to reflection or refraction of propagating electromagnetic wave modes [Frank and Reich, 2004]. Fields are propagated using Maxwell-Ampère’s law
(2) |
and Faraday’s law
(3) |
under the Darwin approzimation (in effect, neglecting the displacement current which is the last term of Eq.2), with closure provided by Ohm’s law
(4) |
with the Hall and electron pressure gradient terms included, and assuming infinite conductivity () as well as adiabatic electrons () with no inertia (); neglecting the first and last terms on the right-hand side. The discretized solver approach described in Londrillo and DelZanna [2004] conserves magnetic field divergence to machine precision, thus obviating the need for any divergence cleaning step for the magnetic field provided initial conditions are divergence-free.
Each spatial cell on the spatial mesh contains a local representation of phase-space, also discretized on a Cartesian velocity mesh. To conserve memory and computational resources, Vlasiator utilizes a sparse velocity space approach, only storing and propagating those regions of velocity space which contain non-negligible phase-space densities. This is performed by declaring the maximum allowable velocity space extent and dividing it into cubic blocks of or, as implemented recently, velocity space cells each. Internally this is declared with notation of WID=4
or WID=8
, with values of WID3=64 and WID3=512, respectively. Typical maximal velocity space extents have been in each direction with 67 blocks per direction, resulting in a velocity space resolution of , or with 6D simulations, with 50 blocks per direction, leading to a velocity space resolution of . This reduction of velocity resolution hand-in-hand with spatial resolution follows the findings of Pfau-Kempf etal. [2018]. Active blocks are added to the velocity space or removed from it as necessary, as described in von Alfthan etal. [2014]. We ephasize that the term sparse is used in this manuscript to discuss this mechanism for partitioning and selectively storing and propagating portions of potential velocity space, and thus it is not directly related to e.g.sparse matrix algorithms. An example of a Vlasiator simulation domain, showcasing spatial mesh refinement and differing spatial cell computational weights due to varying block counts, is shown in Figure 1.
2 Vlasiator data structures
Vlasiator utilizes three different in-house developed grid libraries. The data objects associated with each grid are presented in Figure 2. The first, the Distributed Cartesian Cell-Refinable Grid [DCCRG, Honkonen etal., 2013], houses the particle phase-space data on an octree-cell-refinable spatial mesh. The mesh discretizes the spatial domain in up to three dimensions with each spatial cell containing VDFs for all velocities at a given spatial location. Each spatial cell is represented by a C++ class SpatialCell object as well as additional metadata such as information on the cell resolution and neighbourhood. The DCCRG grid is distributed among MPI tasks based on cell weights, where each cell has its weight determined as a function of how complex a velocity space it contains. This load distribution is performed by the Zoltan library [Boman etal., 2012], which supports several different domain decomposition algorithms such as Recursive Coordinate Bisection (RCB) or Recursive Inertial Bisection (RIB). The grid object of each MPI task contains all spatial cells which have been assigned to the task in question, as well as copies of any neighbouring cells residing on remote tasks, and implements an interface for updating local copies of remote (ghost) cells. Due to the use of Strang splitting, the chosen MPI decomposition ensures that all velocity-space acceleration operations (using the first and third terms on the left-hand side of Equation 1) use only local information. This also implies that any adjustments to the local sparse velocity space grid are performed using local data. An exception to this rule is that cells will also need to ensure they have buffer data so that during spatial advection (solving the first and second terms on the left-hand side of Equation 1) phase-space density has a target to flow into.
Each SpatialCell object contains, among other things, local average information such as plasma moments and largest locally allowed time propagation steps, as well as a vector of particle populations. Vlasiator is usually run with one species of ions, namely protons, and with electrons providing a charge-neutralizing fluid. However, multiple ion populations such as helium as described in Battarbee etal. [2020] are also supported. The specialized eVlasiator model [Battarbee etal., 2021] also maintains two particle populations (electrons and ions) but populates only the electrons. Each of the particle population (declared at the start of the simulation) keeps track of population-specific velocity moments (density, bulk velocity, pressure tensor) and contains descriptors for the VDF contained in that spatial location. As Vlasiator propagates phase-space density , the value contained in each spatial cell is not directly impacted by the cell size, but rather, always represents the spatial average over the local cell volume.
2.1 Sparse block representation in memory
The VDF at a given spatial location is represented by two C++ classes: a Velocity mesh (Vmesh) and a Velocity Block Container (VBC). The Vmesh links memory buffer positions (local address offset i.e.the local block identifier, LID) with physical velocity coordinates (global block identifier, GID). It contains a hash map (globalToLocalMap) with GIDs as keys and LIDs as values. The complementary data structure is a vector (localToGlobalMap) with size equal to that of the map, and where each index (LID) stores the corresponding GID. The VBC, on the other hand, contains the actual VDF phase-space data (Block data container, vector) as WID3 floating point phase-space density values for each block and LID, as well as pre-calculated metadata (Block parameters, vector) for each WID3-sized block in question. LIDs are thus simply indices into WID3-staggered memory space, whereas GIDs represent a position in velocity space, with values starting from 0 for the smallest velocity value in all Cartesian directions, incrementing first along the axis, then along the axis, and finally along the axis. When evaluating 6-dimensional phase-space neighbours of any given block, spatial neighbours will have the same GID, whereas velocity neighbours will have a GID differing by , , or where is the maximum possible number of blocks in Cartesian direction .
2.2 Block adjustment requirements
A naïve estimation of memory consumption for a 6D Vlasiator simulation with fully populated phase-space grids is phase space cells, a memory requirement of about 14 petabytes [Ganse etal., 2023, Kotipalo etal., 2024]. A significant contributor to Vlasiator’s success is the sparse velocity mesh, reducing the computational domain to include only those regions of phase-space where non-negligible density is found, yielding a memory reduction of over two orders of magnitude.This strength comes with the cost of having to constantly adjust and update the structure of phase-space stored in each cell. The acceleration solver evaluates future velocity space extents after performing each shear operation, flagging content blocks which are no longer needed and those which need to be created. The translation solver on the other hand requires that neighbouring spatial cells contain the required data structures for phase-space density to flow into. These requirements are described as the ghost halo regions, as indicated in Figure 3. After running the acceleration solver, each content block is flagged as being either above the sparsity threshold (i.e.having significant content), or below it (i.e.not having significant content). Then, the halo requirements are evaluated for all content blocks, both for velocity and spatial neighbours. Finally, required halo blocks are created, and non-content blocks not required by any halos are deleted.
Managing velocity space sparsity via blocks allows Vlasiator to use standard data units of size WID3, useful for vectorization and GPU approaches. Vlasiator has now been amended to support blocks of size (WID3=512) in addition to the longstanding (WID3=64) block size. This results in much less overhead from creating and deleting blocks since a smaller number of blocks is required to describe any given phase-space distribution function, assuming the resolution remains the same. At the same time, the halo requirements will, on average, lead to a larger memory footprint and more actual phase-space cells to be propagated. This is a trade-off, but one which is expected to prove beneficial on GPU architectures. The implications on halos of changing the value of WID for a simple configuration, whilst maintaining constant maximum velocity space extents, is exemplified in Figure 4.
2.3 Other grid objects in memory
In addition to the DCCRG grid containing spatial cells with particle species VDF data, each MPI task also owns several FSgrid objects which house uniform Cartesian rectangular domains of scalars or vectors for field solver use. The total spatial domain is decomposed uniformly over all MPI tasks, with addition ghost buffer regions around local data, which can be updated from the up to 26 neighbouring tasks. The division into several different FSgrid objects is to facilitate ghost communication of only the necessary data for each step of the field solver operation. The Vlasiator field solver, consisting of much smaller data domains than the Vlasov solver, has yet to be ported to heterogeneous GPU architectures.
A final data object of Vlasiator is the ionospheric grid [Ganse etal., 2024], which is contained on every task in its entirety and does not contain any MPI interface. The ionospheric grid exists only when simulating a magnetospheric run with an ionospheric inner boundary, and its solvers have yet to be ported to GPU architectures.
2.4 Heterogeneous memory decomposition of data structures
As Figure 2 introduced in Section 2 shows, the data structures of Vlasiator are non-trivial when considering heterogeneous memory architectures such as those encountered in modern GPU-powered supercomputing nodes [Zheng etal., 2016, Wang etal., 2014]. Computational activities performed by the CPU can usually only access host-side memory, and parallel computation kernels utilizing GPU hardware can only efficiently access device-side memory. Current pre-exascale and exascale supercomputers are built upon node architectures which house discrete GPUs controlled by standard CPUs, each with discrete memory domains.GPUs are built using high-bandwidth memory [HBM, Zhu etal., 2018], allowing rapid availability of data for GPU computations with speeds of up to 3.2 TB/s on MI250x and 2 TB/S on NVIDIA A100. A bottleneck, however, is the connection between GPU and CPU hardware, currently limited tospeeds of tens to a few hundreds of GB/s per direction (depending on NVLink or PCI Express implementation),necessitating planned and controlled movement of computation-critical data between the two memory subsystems. Upcoming hardware such as the Nvidia Grace Hopper GH200 will upgrade this to a 900 GB/s bidirectional NVLink-C2C connection as well as a transparent unified memory domain doing away with the need for data migration. GPU HBM bandwidth will still remain superior at 3.0 TB/s. Accelerated processing unit (APU) chips such as the AMD Instinct MI300A will contain a single memory domain [Vijayaraghavan etal., 2017], which may simplify memory management on future exascale architectures whilst simultaneously providing massive bandwidth (5.3 TB/s on the MI300A).
GPU application programming interfaces (APIs) provide access methods allowing various ways of copying data between host and device memory regions. A user-friendly implementation (supported on modern architectures such as NVIDIA Pascal and later and AMD MI200+, which still have physically distinct memory domains) is that of Managed Memory or Unified Memory (UM), where a single virtual memory address can point, as necessary, to either host or device memory. The UM subsystem will utilize a paging mechanism, and can even allow oversubscription of device memory, paging (migrating) memory to and from the device as necessary. Whenever the CPU or GPU needs to access data residing in UM, the paging system will be queried for the locality of the data. If the data is not in the required region, a page fault occurs, triggering migration of the data to the correct domain. To reduce page faulting (which can harm performance), users can prefetch data manually to the desired domains. One noteworthy aspect of prefetching and page faulting is that the UM system only migrates data in units of the page size, which can lead to unintended migration of data units residing within the same page as an intentionally migrated data unit. However, the benefit of unified memory is that an unit of data which is accessed from both host and device is ensured to always contain the correct value, irrespective of whether the user has declared prefetches. Should a data construct contain separate host-only and device-only memory buffers, it is the responsibility of the user to perform manual memory copies in order to maintain correct values for host and device activities, respectively. Additionally, automatic oversubscription of device memory is not possible if separate buffers are used. When porting a large simulation software such as Vlasiator, this reliability of UM was deemed a high priority, and thus, many Vlasiator data structures were migrated to UM instead of maintaining pure host and device copies of data buffers.
Of the data containers described above and showcased in Figure 2, only the lowest level of data containers, i.e.the hashmaps and vectors (dark and light blue in the figure) reside in unified memory. This includes both the data buffers and also metadata such as the container sizes. For FSgrid data, both unified memory and dual host-and-device memory versions have been implemented. To facilitate access to Vlasov data structures within the Vmesh and VBC objects from both host and device environments, Vlasiator stores discrete host and device copies of the actual Vmesh and VBC objects, which both point to the same hash maps and vectors residing in UM.
2.5 Data structure re-ordering for Vlasov solver use
The Vlasiator Vlasov solver approach, introduced in Section 1.1,is to decompose the Vlasov equation into six one-dimensional Cartesian shear or advection operations. Each advection operation takes a number of sampling points in the relevant direction, using them to fit a polynomial which describes the variation of phase-space density along the direction of propagation. This polynomial is then advected with relevant speed and direction and integrated over the target phase-space cell dimensions.
For each advection thus performed by our semi-Lagrangian approach, we require a varying number of control points depending on the polynomial reconstruction order. In order to limit the number of ghost cells required, we usually constrain translation to use parabolic polynomials for each cell, whereas acceleration utilizes quadratic polynomials. This results in each advection calculation requiring both the local phase-space density value and values of a number of neighbours in the direction of advection. For this reason, the first step of each Vlasov solver cycle is to take the existing Vlasov data structure andre-arrange it so that memory access operations will occur on contiguous and squential data regions.
For spatial translation, the solver will act on one global block identifier (GID) at a time, collecting the data along the translated direction, utilizing the pencil formalism described in Ganse etal. [2023]. For each spatial cell, the GID being translated will be linked to the local block identifier (LID) with the globalToLocalMap
hashmap object inside the Vmesh. The block data from each memory address given by the LID is then copied into a temporary device-only buffer, providing the solver with aligned block data along the direction of propagation.The original data buffers pointed to by each LID are then emptied (written to zero), and the semi-Lagrangian propagator is then able to write its output into the original data locations.
For velocity-space acceleration, phase-space densities will be advected along one Cartesian velocity direction at a time. The GID associated with each data block is such that values increase in the order . In order to ensure the block data is aligned along the direction of advection , we compute a transposed where . is defined by the order and by the order . In order to build the ordered column data structures, we utilize a sorting algorithm on the -values, thus gaining access to the correctly sorted LIDs as well. The data from these addresses is then copied to a temporary buffer, which can be fed to the Vlasov solver. In the meanwhile, the contents of the velocity blocks are reset, so that the Vlasov solver can write the results of the solver operation into the cell data contents.
3 Porting to GPUs: assessing different approaches
In this section we will describe the approaches considered for porting of Vlasiator to GPU-accelerated architectures along with lessons learned. The observant reader may note that the basic Vlasiator data structure applied to VDFs, being a block of size WID3, is beneficial for performing parallel calculations. Historically Vlasiator has utilized a value of WID3=64, which indicates that two Nvidia 32-element warps or a single AMD 64-element wavefront can perform all necessary evolution calculations for a whole block of data. This stems from an early finite volume method version of Vlasiator, which included a CUDA interface, as presented in Sandroos etal. [2013]. Sadly, when Vlasiator transitioned to the current Semi-Lagrangian solver [Palmroth etal., 2018], the old CUDA support was deprecated. The benefit of an internal block-based data structure which is easily applicable to SIMT approaches, however, has remained a hallmark of Vlasiator ever since, and has been utilized via single instruction multiple data (SIMD) vector operations on CPUs with great success.
3.1 Directive-based approaches
Vlasiator, utilizing existing CPU-based computational facilities to a great degree, is built to perform calculations both in parallel distributed over MPI to several tasks, as well as running several OpenMP threads in parallel, possibly also benefiting from hyperthreading. As such, OpenMP directives are implemented in several parts of the Vlasiator codebase, and directives were an interesting direction for trialling GPU porting. During the autumns of 2019 and 2020, the Vlasiator development team participated in Hackathons where OpenACC directive-based GPU programming was tutored and implemented. However, our testing suggested that OpenACC was not going to provide us with sufficient throughput, as mere pragma-based offloading of selected code sections would not easily result in sufficient re-factoring or provide us with the granularity of memory management foreseen to be required. Vlasiator, implementing a 6-dimensional advection equation, requires each point of VDF data to be available for each solver step, save for certain ghost cell information, meaning VDF data must remain on the device. Additionally, compiler support of pragma-based approaches during our initial porting efforts was lacking. As such, we did not pursue OpenACC approaches further, nor did we trial OpenMP 5.0 GPU-based directives.
3.2 Abstraction layers
A useful approach in porting existing software to new architectures is to utilize a portable abstraction layer, which can be externally maintained and updated to allow interfacing with new hardware and programming languages without the end-user needing to change their code. Abstraction can be built into the grid library itself [such as in AMReX, Zhang etal., 2019] or provided through an external portability layer, which in turn may depend on libraries for memory management. Examples of such portability layers include RAJA (Beckingsale etal. 2019 with array abstractions via CHAI and memory management via UMPIRE Beckingsale etal. [2020]), alpaka (Matthes etal. 2017 with memory management via LLAMA) and KOKKOS (Trott etal. 2022 consisting of both core and tool packages). Industry giants are also attempting to tackle the issues of vendor lock-in: for example the Intel oneAPI [Costanzo etal., 2021, Christgau and Steinke, 2020] implements specialized APIs as well as the portable SYCL language [Alpay etal., 2022]. An interesting note is also that HIP which is often viewed as the de facto language on AMD hardware, is in fact an abstraction layer which translates commands to the ROCm interface used by AMD hardware, whilst also supporting translation to CUDA on Nvidia hardware. Larger adoptance of HIP remains to be seen, though efforts of supporting HIP on top of oneAPI do exist [Videau, 2022].
Whilst utilization of abstraction layers brings with it many benefits, it at the same time detaches the developer from the underlying programming language. Algorithms are restricted to constructions supported by the abstraction layer, which may or may not be useful for the task at hand. Due to the inherent complexity of Vlasiator’s memory layout, consisting of a sparse ever-evolving velocity mesh in each spatial cell, and the need to maintain all data in device memory for every solver step, pre-existing abstraction layers were not included in the roadmap of Vlasiator GPU development. After finding directive-based approaches to be too limited, it would have been counterproductive to run into the same issue with portability layers. Abstraction layers are a possible future avenue for fixed-size grids such as the ionospheric and field solver grids, though.
4 Building semi-Lagrangian Vlasov solvers for CUDA
With the knowledge of CUDA being a well-established and well-supported GPU programming language with powerful development and profiling tools, and knowing that AMD hardware was to be supported with near-analogous API calls via HIP, Vlasiator was decided to be first ported to CUDA, dubbed Cudasiator. Once CUDA development and profiling had sufficiently progressed and HIP support was included, the Cudasiator moniker was dropped in favour of a standard vlasiator_gpu identification.
4.1 Unified memory
The Vlasiator simulation code includes several different solvers and routines which act on the modelled particle VDFs. Initialization routines set up the simulation and boundary cell management can re-write cell contents. VDFs are reduced to obtain plasma velocity moments for use by the field solver. Block values are evaluated to find maximum allowed timesteps due to spatial translation. Output routines can process the VDFs to extract partial moments or precipitating energy spectra. Due to all of these accessors, the VDF is stored in unified memory, which can be paged to host or device as necessary. Using unified memory may incur a performance penalty, but reduces the risk of one code section acting on data which might be out-of-date (having been altered by another part of the code) or trying to access inaccessible memory addresses. During early development, small subsections of the solvers were ported one-by-one, allowing verification of correct code behaviour without having to convert all functions to GPU kernels at once. As data movement between the host and the device is much slower than the HBM of GPU hardware, and would cause a significant performance bottleneck, the final state of the Vlasiator code has all VDF-accessing routines run through GPU kernels.
One feature of unified memory on current NVIDIA and AMD architectures is the possibility of page faulting data to the required location. Unified memory is managed as pages, large units of data (e.g.4 kilobytes), which reside either in host or device memory. When accessing such managed memory, a lookup to the UM page table is performed, and assuming the data is already in the correct place, it is read from the address indicated by the page. If the memory resides on the host when requested by the device, or vice versa, a page fault occurs, and the page lookup table access triggers a memory copy, migrating the page to the requested memory domain. The lookup table is then updated to reflect this new status. Page faults are what allow unified memory to be accessed from both host and device, but repeated page faults cause significant slowdowns as computation must wait for the page transfer to complete.
Memory can be prefetched by the user if one knows that the data resides in the incorrect location, however, this incurs an overhead of reading the page lookup table and returning only once the potential transfer of each page associated with the buffer in question has been requested. To completely avoid page faults, a synchronization call requesting for this transfer to complete is also required. The combination of these lookup table accesses and GPU API calls will also incur an overhead, so prefetching should only be performed when the software is sure the data resides in the incorrect memory regime. The current GPU version of Vlasiator contains a number of small counters which reside in unified memory, and the page faulting of these back-and-forth is quite detrimental to performance, but manual prefetching is even worse due to the overhead of GPU API calls. If manual prefetching is performed, the GPU driver may be unable to optimize whether a unit of data should be paged back and forth, or if it should simply be copied into a temporary buffer, maintaining the page location where it is.
Of course if a unit of data is not actually required on both host and device, use of unified memory should be avoided. Pure device memory access does not pass through page lookups, whereas host memory usually does. To further improve host performance, in particular when copying data between host and device buffers, the use of pinned memory is crucial. Pinned memory is declared outside the paging system, bypassing the need to go via host page lookups when communicating with the GPU device. In Vlasiator, we use pure device buffers and pinned host memory for any single-location temporary buffers which allow it.
4.2 Re-using device and unified memory buffers
In addition to managing paging of the unified memory subsystem, care must be taken in managing allocations and deallocations. Within the current CUDA and HIP/ROCm ecosystems, managed or unified memory can only be allocated or deallocated when no GPU kernels or memory transfer operations are active. Thus, these are blocking operations which can have a significant detrimental effect on performance. Thus, the use of buffers should be designed as to minimise allocations and deallocations, re-using buffers where necessary. This is of particular importance to Vlasiator, where required buffer sizes change constantly as the velocity meshes grow and shrink. Vlasiator thus utilizes a method of maintaining slightly larger buffers than required, re-using them for subsequent kernels, and whenever a re-allocation takes place, making the new allocation larger by a factor of e.g. 30% in order to postpone the need of future re-allocations. Allocations of pure device memory (or host memory) do not block the whole device, as they can be performed asynchronously in one GPU stream, and thus they are not as crucial to avoid.
4.3 CPU threads and GPU streams
Vlasiator on CPUs is massively parallel, utilizing MPI for communication between different tasks and supercomputer nodes, OpenMP threading for parallel processing on several CPU cores, and vectorized instructions to process several units of data with a single instruction (SIMD). GPU hardware is also capable of running several operations in parallel, both as GPU warps or wavefronts (SIMT, single instruction multiple threads), but also as several different kernels running in parallel. Vlasiator utilizes this capability in a heterogeneous manner by using several CPU threads to prepare data and queue operations, with the GPU performing the heavy lifting. Each CPU OpenMP thread is given an unique GPU stream to which it can queue operations with each stream acting independent of other streams. Should there be a data dependency between different streams, this can be managed by synchronizing the streams in question, or usually, the whole device, before continuing with enqueueing new GPU kernels or operations.
As each GPU stream will contain e.g.the different sub-kernels required by the acceleration Vlasov solver, and any operations queued to that stream will be performed in order, it makes sense to ensure that each stream has its own re-writeable buffers to act on. Thus, on the host side, each CPU thread (controlling a designated GPU stream) maintains its own temporary buffers. These buffers are re-used when that thread (and, by extension, stream) moves on to the next unit of data to be processed, such as the next spatial cell. Due to the use of these per-thread buffers, the translation solver is also launched in several streams, each using the buffer associated with that thread/stream.
4.4 A Sparse velocity mesh on GPU devices
In order to be able to access and edit the sparse velocity meshes used in Vlasiator from on-device kernels, a hashmap supporting on-device operations was required. During the porting of Vlasiator to heterogeneous GPU architectures, we performed porting bit-by-bit, requiring thus a hashmap which could be accessed from both host and device, in order to provide correct functionality even before all operations had been completed. For this purpose, we developed Hashinator, a portable hashmap library using unified memory and supporting both CUDA and HIP/ROCm. In addition, we made available a portable unified memory vector class called Splitvector, on which Hashinator is built. These tools, along with their portability features and excellent performance, are introduced in Papadakis etal. [2024a] and available at Papadakis etal. [2024b].
Utilizing the Hashinator and Splitvector libraries, the Vmesh and VBC contents (hashmaps and vectors in dark and pale blue in Figure 2) were converted to unified memory constructs, capable of being accessed from both host and device. Though these intermediate objects themselves could also reside in unified memory, we found that it was preferred to maintain host and device copies of Vmeshes and VBCs, which contain host and device copies of splitvectors and hashmaps, which in turn contain unified memory arrays and counters. This helps to reduce page faulting between host and device, but requires the device copy of the objects to be updated whenever a re-allocation occurs in order to update unified memory pointers. A common occurrence when launching kernels is to evaluate the size of the velocity mesh, which would usually mean a call to the size()
function of the globalToLocalMap
hashmap or the localToGlobalMap
splitvector. We found that carefully caching this value as a host-side integer and updating it whenever the velocity mesh size is adjusted resulted in reduced page faulting.
4.5 GPU thread parallelism
As Vlasiator already utilizes memory stored in units of WID3
floating-point values, acting on 32 (CUDA warp, NVIDIA) or 64 (ROCm wavefront, AMD) elements in one pass might have been relatively straightforward. However, the acceleration solver in particular acts only on one WID
WID
slice at a time, in order to ensure thread-safe storage of phase-space density values. This small slice-by-slice approach is sub-optimal on GPU hardware if the standard value of WID=4
is retained. A future task will be to refactor the memory access patterns to allow a whole WID3
block to be processed at once, but for now, a transition to WID=8
means that each WID
WID
slice has 64 entries and can be efficiently processed on GPU hardware. When a GPU kernel performs operations on a globalToLocalMap
hashmap, we utilize the warp accessors of Hashinator, doing lookups of a whole warp length at once, further improving thread-parallelism.
4.6 Porting the Vlasov solvers
The actual process of porting Vlasiator’s Vlasov solvers to GPU kernels is relatively straightforward. Care must be taken to minimize memory transfers between host and device. Re-use of existing buffers is preferred over allocations and de-allocations, which can improve or hinder memory management challenges.
The acceleration solver has to evaluate the contents of the local velocity mesh and re-assemble it into contiguous memory domains dependant on the Cartesian direction in which the shear in question is taking place. One subsection of this involves converting the velocity mesh GIDs to direction-dependent values (see Section 2.5). The new -values are then sorted in order to achieve contiguous input data to the remapping algorithm.Notably, this is a place where existing sorting libraries such as CUB::sort on CUDA and hipCUB::sort on HIP can be utilized. When evaluating extents of contiguous columns of blocks to be passed to the acceleration solver, thread-parallel GPU search operations provide performance improvements.
The translation solver has a relatively simple flow, allowing merging the preparatory, advection, and storage sections into one large kernel which can be called in parallel utilizing several temporary buffers. The acceleration solver, on the other hand, involves much more transposition and restructuring of data, resulting in a number of small kernels called one after another, which impacts performance by accruing API overhead and sometimes requires passing counter values between the host and the device. After preparatory work, the accelerated cell velocity space must be adjusted to accommodate the new positions in velocity space, which requires calls to the block adjustment algorithms. Furthermore, each spatial cell is accelerated independently of all others, meaning that though they can be processed in parallel, each one requires its own kernel calls in order. Each CPU thread is able to process one cell at a time in its own GPU stream, but extracting acceptable performance from modern GPU hardware may require even further re-factoring of the acceleration solver ordering.
5 Block adjustment algorithms
At the heart of Vlasiator’s operations, in addition to the SLICE-3D algorithm and the semi-Lagrangian solvers, is the block adjustment algorithm used to maintain the minimum sufficient velocity domain for the problem at hand. A requirement for block adjustment is that in the final state, the velocity mesh should contain the correct GID values, each associated with an unique LID, and the LID-values should be consecutive, i.e. the velocity mesh should be contiguous without any empty spaces left by deleted blocks. In CPU iterations of Vlasiator, the block adjustment algorithm was built using lists of those blocks where content was deemed significant enough, and the use of a std::unordered_set
, with a serial algorithm to adjust contents.
5.1 Serial CPU block adjustment
The serial CPU block adjustment algorithm of Vlasiator is described as a flowchart in Figure 5. In section a), for each spatial cell, all existing velocity blocks are analyzed. If any of the velocity cells within the block have , the block is considered a content block. The loop over the block contents can end early once the first value above the threshold is found. Blocks are thus categorized into two complementary lists (vectors, black lettering) as either being classified as content blocks (green) or non-content blocks (red). Section a) must be completed for all spatial cells before moving on.
Next, in section b), the spatial and velocity halo requirements are fulfilled. For each cell, all its spatial neighbours are identified. For all these spatial neighbour cells, their lists of content cells are looped over, with all contained GIDs added to a required content std::unordered_set
(white lettering). For the local cell in question, any local blocks with content are also looped over, and all these blocks and their 26 velocity neighbours are also added to this set. The set contents are indicated as blocks with content (green), blocks with no content (purple), and non-existing blocks (blue, yet to be added).
The third phase, section c), involves pruning non-required blocks from the velocity mesh. The list of local blocks without content is looped over in reverse order (that is, starting from the largest GID i.e.from towards the end of the allocated buffer). Any blocks which are not found in the required content set (dark red) are deleted from the mesh. If the block was the last entry, the mesh size is simply reduced by one block. However, if the block was not the last entry in the mesh, it is replaced with the contents of the last existing block, and then the mesh is reduced in size by one block.
In the final phase, section d), a loop is performed over the required content set. Each GID found in the required content set is compared against the current state of the velocity mesh, appending the velocity mesh with any blocks which do not yet exist (blue). This results in the final state, where the velocity mesh contains all required GID values, each associated with an unique LID forming a consecutive list. This serial approach is straightforward, simple, and results in a relatively low amount of data transfer, but is sequential in nature and does not parallelize well on GPU hardware.
5.2 Parallel GPU block adjustment
The objective of developing the parallel block adjustment algorithm described in this section was to be able to launch GPU kernels which could act in parallel on sections of the same velocity mesh, adding, deleting, adjusting, and moving blocks so that the end result is a valid velocity mesh and so that the kernels do not need to atomically access counters such as taking the next entry from a vector. The algorithm developed for this purpose is displayed in Figure6, and it is run for all spatial cells separately.
In section a), for each spatial cell, all existing velocity blocks (within the localToGlobalMap
list or vector, black lettering) are analyzed. The original count is stored. A kernel examines all blocks in parallel, and for each block an in-kernel thread-parallel reduction analyzes all contained phase-space cells at once. If any phase-space cell fulfills , the block is considered a content block. Blocks are thus categorized into two complementary maps (white lettering) as either content blocks (green) or non-content blocks (red). The maps store the block GID as the key and the LID as the value. After the parallel insertions, the entries from the content map are extracted into a contiguous content list (vector, black lettering). Section a) must be completed for all spatial cells before moving on.
Following the gathering of content lists, section b) ensures all halo requirements are fulfilled. For each cell, all its spatial neighbours are identified. For all these spatial neighbour cells (managed in parallel), their content lists are looped over in parallel, and the GIDs are evaluated against the local content maps. Any neighbour content GIDs found in the no-content map are deleted from there, and instead placed in the content map (purple). If the neighbour GID does not yet exist in either map, it is added to the content map with an invalid LID identifier flagging it as a cell yet to be added (blue). For the local cell, all content blocks and their 26 velocity neighbours are also looped over in parallel, and values are added to the content map and deleted from the no-content map as before. As listed, all of these operations can be performed in parallel. The content map elements are indicated in the figure as blocks with local content (green), blocks with no local content (purple), and non-existing blocks (blue, yet to be added). After all halo requirements have been evaluated, the two resultant maps are evaluated in order to calculate the size of the velocity mesh after all required operations will be complete. The size of the no-content map provides the number of blocks to remove . The content map is passed to a parallel stream compaction routine, extracting all entries flagged as yet-to-be-added GIDs into a list of blocks to add, and the size of this list provides .
Next, knowing that , the generated content and no-content maps can be evaluated in order to prepare lists of GIDs for parallel processing. Using as an extraction criterion, checking if map entries have a valid value (LID) less than it, we use parallel stream compaction to extract three lists of blocks. Any blocks with content residing beyond the final size of the mesh are designated blocks to rescue. Blocks without content residing beyond the final size of the mesh are to be simply deleted. Blocks without content within the resultant domain of the mesh are to be removed and replaced with either added or rescued blocks.
In the final step, shown in section d) the four constructed lists (blocks to be added, blocks to be deleted, blocks to be replaced, and blocks to be rescued) are used in a parallel operation altering the actual velocity mesh associated with the spatial cell in question. If , the size of the mesh is increased, in order to have sufficient space for all parallel operations. This size adjustment is done on-device if possible in order to avoid page faulting of counters. If the capacity of the map is insufficient, it is performed on host instead. Then, a GPU kernel is launched, where each parallel instance operates on one value or pair of values from the generated lists. Depending on the relative sizes of the four lists, the operation can involve outright deletion of a block, addition and emptying of a new block, replacing an existing block with a new empty block, or replacing an existing block with a block to be rescued. After all these parallel operations have been completed, the contents of the velocity mesh are in their final state. Then, if , the size of the mesh is decreased to match the contents.
6 Data reduction methods
The Vlasov solvers of Vlasiator act as a 6-dimensional advection equation, but the Maxwell-Vlasov system also requires input from electromagnetic fields which are propagated by the field solvers. Those field solvers take plasma velocity moments as input, necessitating an efficient way to reduce VDF contents.
6.1 Time step constraints
Although Vlasiator’s semi-Lagrangian solvers do not place a CFL condition as such, a similar limitation is in place due to a limited thickness of the spatial ghost cell layer. Thus, the dynamic simulation time step may be constrained by the largest advection velocity found in the simulation. To this end, a timestep limit reduction kernel is required, which finds the smallest value of , , or in the simulation, which, with a safety margin, gives the largest allowed spatial advection time step.
6.2 Plasma moments
Reduction of plasma velocity moments [see, e.g., Palmroth etal., 2018] for the simulation domain is performed by gathering device-side pointers to the velocity mesh and block container structures for all local spatial cells, preparing output buffers, and then calling a massively parallel reduction kernel over all local spatial cells. Earlier versions of Vlasiator called a reduction kernel on one cell at a time, but the overhead from repeated kernel calls and back-and-forth transfer of minuscule units of data was prohibitively slow. Kernel merging with no other changes was found to provide a speed improvement of order .
6.3 Output data reducers
A single full state of the Vlasiator simulation, as seen in a simulation checkpoint file, can be several terabytes. In order to facilitate data analysis with more reasonable file sizes, output files with reduced data are generated at regular intervals. The most commonly used values stored in these files are values such as electric and magnetic field vectors and the aforementioned plasma moments. However, more detailed data reducers operating over the velocity space of each spatial cell, such as measures of non-Maxwellianity or sub-region analysis (particle precipitation, moments of supra-thermal sub-populations) are also available. Vlasiator implements a flexible ARCH-lambda approach for data reduction, where the calculations to be performed in order to extract the necessary information are prepared as a C++ lambda function, which is passed over to an interface which then either passes it to CPU-based loops, or if GPU hardware is being used, constructs a kernel to which it is passed. The lambda approach as such does not accrue noteworthy overhead, but if the reduction kernel is called for each spatial cell in turn, API overhead as described in Section 6.2 can become prohibitive.
7 Unified cross-platform codebase
A common issue with large software packages which are developed by several people in parallel for several different platforms is that of code stagnation and cross-compilation issues. When developing Vlasiator support for GPU architectures, a certain amount of code duplication was unavoidable. However, a strong object-oriented approach allows selective compilation of only those source code files which contain objects and routines to be used by the target architecture. The Vlasiator GPU version is thus developed as one unified codebase, with the compilation target (CPU, CUDA, or HIP/ROCm) chosen by environment variables at compile time.
Early during development planning, a joint CUDA/HIP development approach was chosen. Initially, Vlasiator was developed as a version called Cudasiator, with direct CUDA instructions, libraries and classes. The initial plan was to use automatic conversion scripts (such as hipify-perl.sh
, included as part of HIP/ROCm) during compile time to convert the source code to a suitable form. This approach has a significant limitation, in that any code development performed on the HIP/ROCm platform would then need to be reverse-engineered back into CUDA code before being committed to the code repository.
Joint support for AMD hardware with the HIP language was subsequently achieved by converting all Vlasiator GPU API calls to a dedicated macro naming scheme. Depending on compile-time instructions, these macros are then converted by the compiler preprocessor into either CUDA or HIP calls. For example, the Vlasiator codebase will contain a call to gpuStreamSynchronize()
, which will be converted to either cudaStreamSynchronize()
or hipStreamSynchronize()
, as required by the architecture. All these macro definitions are collected into two header files, and the one to be included in compilation is defined via compile-time environment selection. As CUDA and HIP/ROCm are mostly interoperable, this process is relatively straightforward. Special cases include specialized kernels which must know the size of a warp or wavefront, which is also declared as a macro GPUTHREADS
which resolves to either 32 or 64.
We note that this approach is suitable also for situations where the CUDA and HIP interfaces differ, such as in-kernel ballot operations. A call to gpuKernelBallot(mask, input)
will be pre-processed into either __ballot_sync(mask, input)
(CUDA) or __ballot(input)
(HIP), ensuring correct compilation and operation. However, this has the limitation that the user code must always use a full mask, somewhat limiting the capabilities of CUDA operation. At the same time, advanced CUDA features such as cooperative groups are also not supported as they have no direct HIP counterpart [Tsai etal., 2021].
8 Performance results
Efficient use of GPU hardware requires often considerable refactoring of code and development of new algorithms. Expecting a freshly ported code to match or exceed the performance of a CPU code which has been developed and optimized over several years or even a decade is not necessarily realistic. Still, GPU hardware provides access to high-bandwidth memory and tens of thousands of compute cores per device, so some level of performance improvement might be expected.
To showcase the capabilities of Vlasiator, we present here simulations performed on the Mahti supercomputer at CSC in Kajaani, Finland. A single Mahti node contains two AMD Rome 7H12 CPUs with 64 cores each. The CPUs are based on AMD Zen 2 architecture, supporting the AVX2 vector instruction set, and run at 2.6 GHz base frequency (max boost up to 3.3 GHz). GPU nodes are equipped with four Nvidia Ampere A100 GPUs with 40 GiB of memory each. For the purposes of this comparison, we thus select 32 cores from one CPU, with no multithreading active, which matches the capabilities assigned to each A100 GPU. The CPU tests utilize CPU instructions only, with AVX2 vectorization, whereas the GPU tests operate on one A100 device from 32 concurrent CPU threads and 32 CUDA streams. The GPU versions were compiled with NVCC in an environment with gcc version 10.4.0 and CUDA version 12.1.1. Compilation of HIP code for CUDA [as shown in, e.g., Tsai etal., 2021] was excluded from this test. The CPU versions were compiled with gcc version 11.2.0 and utilize the jemalloc version 5.2.1 memory manager [Evans, 2006].
Furthermore, we include some results from simulations performed on the LUMI-G supercomputer at CSC in Kajaani, Finland. A single LUMI-G node contains one AMD Trento 7A53 64-Core Processor and four AMD Instinct MI250x GPUs each with 128 GiB of memory. Each MI250x GPU consists of two independent Graphics Compute Dies (GCDs), visible to the user as separate GPUs despite residing on the same hardware card. The recommended configuration is to use 6 CPU threads per GCD, leaving two threads per GCD for system operations. The HIP/ROCm version was compiled with the hipcc
compiler wrapper, HIP version 5.6.3, AMD clang version 16.0.0, and ROCm version 5.6.1.
8.1 Simulation setup
The simulation test case consists of a plasma shear flow interface, with two counter-propagating domains of magnetized plasma. The simulation domain is compact and 2-dimensional, spanning 32 cells in the -direction and 28 cells in the -direction with cell edges of extent 500 km. In the left () half of the simulation domain, plasma has a density of and a bulk flow velocity of . In the right () half of the simulation domain, plasma has a density of and a bulk flow velocity of . The magnetic field is uniform with a value of , resulting in an ion gyroperiod of . The domains are set to have a pressure balance, with a uniform pressure of , resulting in temperatures of in the left half and in the right half of the simulation domain. The central transition follows a tanh
profile with a transition width of . In addition, a small velocity perturbation of amplitude is superposed, with a wavelength spanning the simulation box in the -direction. In each spatial cell, the velocity space is constructed with maximal extents of in each direction and a velocity resolution of in all of , , and . Figure 7 shows the simulation states at initialization and after s of simulation, showing how a Kelvin-Helmholtz instability [Tarvus etal., 2024] develops causing uneven plasma mixing at the interface. The simulations are run for a total of s which corresponds to almost 129 gyroperiods, with the KH instability growth saturating after about 50 gyroperiods [as shown in Figure 4 of Tarvus etal., 2024]. This simulation time results in 2337 timesteps when using a value of WID=4
, and 2494 timesteps when using a value of WID=8
. This difference stems from WID=8
resulting in a stricter CFL translation constraint due to larger velocity space extents, which in turn arise from the large ghost halo domains, as exemplified in Figure 4. We note that this simulation usefully has a stable computational cost (average block counts remain roughly constant) throughout whilst still involving two different plasma domains and showcasing the relevant dynamics.
8.2 Performance comparison on NVIDIA hardware
Both CPU and CUDA versions of Vlasiator were used to run the aforementioned test simulation. For both architectures, variants with both WID=4
and WID=8
were compiled and simulated. The WID=4
versions resulted in approximately 52M blocks in the simulation in total, so roughly 80K blocks per spatial cell. This is greater than usually seen in Vlasiator simulations, but a realistic count in e.g.hot magnetosheath portions of 2D-3V magnetospheric simulations. By comparison, the WID=8
versions resulted in approximately 8.8M blocks in the simulation in total, so roughly 12.7K blocks per spatial cell. The ratio is not quite due to the larger ghost halo domains, as Figure 4 would suggest.
The bar charts in panel a) of Figure 8 show that for CPU variants, most solver portions perform roughly at similar efficiency for both WID=4
and WID=8
. A notable exception is block adjustment, where WID=8
takes the lead due to much smaller block lists to manage. The GPU version with WID=4
performs overall somewhat weaker than the CPU versions, due to slightly less efficient block adjustment performance. The total time spent in velocity space updates is larger than CPU variants by almost 80%, which is partly due to the included block adjustment routines, and partly due to insufficient optimization of the small auxiliary kernels of acceleration. With WID=8
, however, the GPU version takes a strong lead by utilizing GPU thread parallelism to its fullest. Total propagation time is improved over CPU variants by a factor of three, and both GPU variants show a factor of improvement also in spatial translation.
As an additional point of comparison, the simulation was attempted with the GPU version which had the CPU-based serial block adjustment (see Section 5.1) implemented in place of the new parallel GPU block adjustment algorithm (see Section 5.2). This version did however perform the computations on-device in order to reduce page faulting. Nevertheless, this serial approach was much slower than any other version, as shown in panel b) of Figure 8.
These results showcase that with careful design and algorithmic improvements, the capabilities of modern GPU hardware can be harnessed to enhance performance of even complicated simulation simulation codes such as Vlasiator with ever-changing meshes. However, the test case presented here involves complex velocity meshes with large block counts and a relatively small 2-D spatial grid. A similar performance improvement should not be expected in all cases. Preliminary testing with several MPI tasks (not shown) did not yet result in acceptable performance, which we attribute to a lack of GPUdirect RDMA implementation, forcing all MPI communication to go via the host. This hurdle will be tackled in future updates, as efficient multi-node GPU implementations have been shown to be possible [Pekkilä etal., 2022].
8.2.1 Roofline analysis on NVIDIA hardware
In order to gauge the expected future performance improvements which can be coaxed out of Vlasiator’s GPU version, we used NVIDIA’s performance analysis tools to, amongst other things, complete roofline analysis of the Vlasiator spatial translation kernel, as shown in Figure 9. Roofline analysis evaluates the achieved memory transfer rates and the number of arithmetic operations performed by the kernel in relation to the maximum available memory bandwidth and compute capability. The -axis of a roofline plot is the arithmetic intensity, or ratio of arithmetic operations to bytes accessed. The left-hand side of the plot shows a slope, indicating maximum possible performance as limited by available memory bandwidth. On the right-hand side of a roofline plot operations are limited by arithmetic compute capacity. The Vlasiator translation kernel contains both double and single precision floating point operations and has an arithmetic intensity of roughly 3, placing it in the memory-bound regime. The -axis indicates actual performance in floating point operations per second (FLOP/s) with the Vlasiator translation kernel achieving on the order of 250 GFLOP/s.
We note that this roofline analysis was somewhat artificial due to constraints placed by the NVIDIA CUDA toolchain. Instead of using the usual per-thread per-stream temporary buffers as mentioned in Section 4.3, this test was performed with only a single thread and stream. The temporary buffer of this thread/stream was ensured to be sufficiently large to allow the whole translation task to fit in memory at once. Even so, analysis of this profiling indicated that the kernel experienced stalling during memory reads, explaining why the achieved performance is roughly an order of magnitude below the maximum achievable value. A significant portion of these stalls were long scoreboard stalls when loading VDFs from global accelerator memory. Future optimizations to memory access patterns and pre-caching may improve the performance of this section.
8.2.2 Overall performance on CUDA and NVIDIA hardware
The performance of the acceleration portion of Vlasiator’s solvers on GPU hardware is less straightforward to evaluate, as it consists of several small kernels to be operated in succession. Additionally, the acceleration process updates spatial space for one cell at a time (per thread and stream), so filling the whole device with one kernel invocation is not possible, barring a comprehensive re-write of the whole acceleration framework. This may be a task for a future code update.
In addition to per-kernel performance, total Vlasiator GPU performance was found to be impacted considerably by page faulting and GPU API calls (see Section 2.4) due to several small kernel invocations. This is exemplified in Figure 10, which shows a timeline over one simulation timestep for the GPU WID=8
run as visualized via NVIDIA Nsight Systems, for the previously presented KHI simulation. We utilize the phiprof [von Alfthan, 2023] profiling library, which involves user-defined nested timer regions building up to a profile tree. When compiled on a GPU-compatible platform, phiprof can leverage NVTX (NVIDIA) or ROC-TX (AMD) tracing hooks, allowing the same nested timer ranges to show up in GPU-specific profiling tools.
As an example, the first section of block adjustment (“Com…” within the “re-adjust blocks” timer range) in Figure 10 consists of computing content lists for all spatial cells, a task which has already been optimized to minimize page faults. Before this optimization, page faulting of small counters resulted in spending roughly the amount of time in this range. A future optimization task is to consolidate content list updates into a few large kernel launches which is expected to provide an up to performance improvement, based on the results of Section 6.2.
8.3 Performance on HIP/ROCm and AMD hardware
To evaluate the performance of Vlasiator on AMD hardware, and to validate the usefulness of the unified codebase approach of Vlasiator, we also ran the same Kelvin-Helmholtz simulation on LUMI-G utilizing HIP/ROCm and AMD hardware. To speed up access to unified memory allocations on LUMI-G, we pre-loaded a custom memory manager (https://github.com/sfantao/vlasiator-mempool). Similar to Figure 10, Figure 11 shows a Vlasiator timeline over one simulation time step, showing the different solver sections at work. Comparing the two timelines confirms that overall, the two timelines appear similar, with semi-Lagrangian acceleration taking the longest, and block adjustment and translation taking a quarter to a third of the overall time. On LUMI-G, the translation kernel appears somewhat slower relative to acceleration. This was initially attributed to the low amount of hardware threads resulting in less temporary buffers available for the translation kernel, constraining parallel launch parameters, but increasing the buffer sizes did not appear to alleviate the issue.
This slight relative change in translation versus acceleration is less surprising than the overall poor performance on the tested hardware, as shown in Figure 12. Despite driving a powerful accelerator with several CPU cores and hardware threads, LUMI-G performance was roughly 3.7% of NVIDIA performance with WID=8
and 5.9% with WID=4
. A single AMD MI250x GCD is not expected to match the performance of a single NVIDIA A100 card, but they should be able to achieve the same magnitude of performance. Even the optimized straightforward reduction kernels used for velocity moments of the VDFs (see Section 6.2) showed a penalty in performance. This may be partially due to device drivers in need of updating (which is the responsibility of system administrators, not end users) or, for example, slower response time in the unified memory paging subsystem (see Section 4.1). A positive note is that running several MPI tasks was able to benefit from the increased throughput, with the LUMI-G system supporting direct network access from GPU memory (GPUDirect RDMA). The apparent strong scaling, going from 1 MPI task to 8 tasks, is not representative of overall scaling as a single MPI task will be able to skip all network communication entirely. Additionally, the presented simulation case was chosen to have large block counts, which results in heavier-than-usual MPI communication patterns. Multi-node GPU performance analysis will be the topic of a subsequent study.
9 Conclusions and outlook
In this paper, we have presented evidence of how porting a complicated modern simulation software to support heterogeneous CPU-GPU architectures whilst maintaining portability and updating a sparse velocity mesh can be challenging. Through algorithmic re-design and memory management optimization, a significant portion of the potential of GPU hardware can be harnessed, as is shown by the results in Figure 8. The use of unified memory (see Section 4.1) has been very helpful during the porting process, but places certain constraints on performance due to the overhead of page faulting. A future task will be to use a managed memory pool library [such as UMPIRE Beckingsale etal., 2020] for mitigating blocking operations such as (de)allocations (see Section 4.2).
The impressive performance achieved in e.g.moments reductions when using a single massively parallel kernel showcase that transitioning from several concurrent or consecutive small kernel launches to few large launches is crucial to achieving optimal performance on systems with NVIDIA hardware. This task has been mostly completed for the Vlasiator spatial translation solver, but applying the same formalism to the acceleration solver will be a major undertaking.
Our results show that CUDA and HIP/ROCm are indeed sufficiently interchangeable. Building a unified codebase which is translated to the language in question using preprocessor macros is quite feasible and we would recommended such an approach. The underwhelming performance achieved on LUMI-G and AMD hardware is yet to be explained, though extensive use of unified memory is one potential culprit. This investigation is thus ongoing.
Porting modern HPC simulation softwares to support GPU hardware is not a question of IF, but rather a question of WHEN and HOW. We conclude that due to the extent of required algorithmic changes, the task should be initiated as early as possible, and we advise use of industry-standard methods with good compiler and profiling support. A report on further GPU performance improvements of Vlasiator, and in particular, use of Vlasiator in multi-GPU multi-node environtments, is left as topic of a subsequent study.
Data Availability Statement
The Vlasiator simulation code is distributed under the GPL-2 open source license at https://github.com/fmihpc/vlasiator [Pfau-Kempf etal., 2022] with the GPU branch pre-release available at Battarbee etal. [2024].Data handling and visualisation is supported through the Analysator python package https://github.com/fmihpc/analysator as well as a plugin for the VisIt visualisation system [Childs etal., 2012].Configuration files defining simulation parameters for runs presented in this paper are available from the authors upon request.
Acknowledgements
This work builds upon the results achieved through the European Research Council Consolidator Grant 682068-PRESTISSIMO. The majority of the results presented in this paper were achieved with funding provided by the Research Council of Finland project ICT_SUNVAC, grant number 335554. Further support was granted by the EuroHPC “Plasma-PEPSC” Centre of Excellence (Grant number 4100455) and the Research Council of Finland matching funding (grant number 359806).The authors gratefully also acknowledge the Research Council of Finland grant numbers339327, 309937, 3189131, and 339756). The Finnish Centre of Excellence in Research of Sustainable Space, funded through the Academy of Finland grant 352846, supports Vlasiator development and science as well.
The authors would like to thank the LUMI Porting Program 2023 (https://lumi-supercomputer.eu/porting-program-2023-5/, with support from both LUMI system administrators and AMD / Advanced Micro Devices, Inc, with particular thans to Samuel Antao at AMD for assisting in compiling, running, and managing memory on the LUMI-G supercomputer.Several GPU porting hackathons organized by CSC – IT Center for Science Ltd have also been instrumental in guiding Vlasiator’s GPU development. The simulations for this publication were run on the “Mahti” and “LUMI” Machines at the CSC site in Kajaani, Finland.
References
- Alpay etal. [2022]Aksel Alpay, Bálint Soproni, Holger Wünsche, and Vincent Heuveline.Exploring the possibility of a hipsycl-based implementation of oneapi.In International Workshop on OpenCL, pages 1–12, 2022.
- Asaduzzaman etal. [2021]Abu Asaduzzaman, Alec Trent, SOsborne, CAldershof, and FadiN Sibai.Impact of cuda and opencl on parallel and distributed computing.In 2021 8th International Conference on Electrical and Electronics Engineering (ICEEE), pages 238–242. IEEE, 2021.
- Battarbee etal. [2020]Markus Battarbee, Xochitl Blanco-Cano, Lucile Turc, P.Kajdič, Andreas Johlander, Vertti Tarvus, Stephen Fuselier, Karlheinz Trattner, Markku Alho, Thiago Brito, Urs Ganse, Yann Pfau-Kempf, Mojtaba Akhavan-Tafti, T.Karlsson, Savvas Raptis, Maxime Dubart, Maxime Grandin, Jonas Suni, and Minna Palmroth.Helium in the earth’s foreshock: a global vlasiator survey.Annales Geophysicae, 38(5):1081–1099, 2020.10.5194/angeo-38-1081-2020.
- Battarbee etal. [2021]Markus Battarbee, Thiago Brito, Markku Alho, Yann Pfau-Kempf, Maxime Grandin, Urs Ganse, Konstantinos Papadakis, Andreas Johlander, Lucile Turc, Maxime Dubart, and Minna Palmroth.Vlasov simulation of electrons in the context of hybrid global models: an eVlasiator approach.Annales Geophysicae, 39(1):85–103, January 2021.10.5194/angeo-39-85-2021.
- Battarbee etal. [2024]Markus Battarbee, Kostis Papadakis, Urs Ganse, Jaro Hokkanen, Samuel Antao, Leo Kotipalo, Yann Pfau-Kempf, and Markku Alho.fmihpc/vlasiator: Vlasiator GPU pre-release v1, 6 2024.URL https://doi.org/10.5281/zenodo.11472029.
- Beckingsale etal. [2019]D.A. Beckingsale, J.Burmark, R.Hornung, H.Jones, W.Killian, A.J. Kunen, O.Pearce, P.Robinson, B.S. Ryujin, and T.R.W. Scogland.RAJA: Portable performance for large-scale scientific applications.In IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), volume 4,37, page 1370. The Open Journal, May 2019.10.1109/P3HPC49587.2019.00012.URL https://conferences.computer.org/sc19w/2019/#!/toc/14.
- Beckingsale etal. [2020]D.A. Beckingsale, M.J. McFadden, J.P.S. Dahm, R.Pankajakshan, and R.D. Hornung.Umpire: Application-focused management and coordination of complex hierarchical memory.IBM Journal of Research and Development, 64(3/4):00:1–00:10, 2020.10.1147/JRD.2019.2954403.
- Boman etal. [2012]E.G. Boman, U.V. Catalyurek, C.Chevalier, and K.D. Devine.The Zoltan and Isorropia parallel toolkits for combinatorial scientific computing: Partitioning, ordering, and coloring.Scientific Programming, 20(2):129–150, 2012.10.3233/SPR-2012-0342.
- Childs etal. [2012]Hank Childs, Eric Brugger, Brad Whitlock, Jeremy Meredith, Sean Ahern, David Pugmire, Kathleen Biagas, Mark Miller, Cyrus Harrison, GuntherH. Weber, Hari Krishnan, Thomas Fogal, Allen Sanderson, Christoph Garth, E.Wes Bethel, David Camp, Oliver Rübel, Marc Durant, JeanM. Favre, and Paul Navrátil.Visit: An end-user tool for visualizing and analyzing very large data.In High Performance Visualization–Enabling Extreme-Scale Scientific Insight, pages 357–372. , October 2012.10.1201/b12985.
- Christgau and Steinke [2020]Steffen Christgau and Thomas Steinke.Porting a legacy cuda stencil code to oneapi.In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 359–367. IEEE, 2020.
- Costanzo etal. [2021]Manuel Costanzo, Enzo Rucci, CarlosGarcía Sanchez, and Marcelo Naiouf.Early experiences migrating cuda codes to oneapi.arXiv preprint arXiv:2105.13489, 2021.
- Einkemmer and Ostermann [2014]Lukas Einkemmer and Alexander Ostermann.Convergence analysis of strang splitting for vlasov-type equations.SIAM Journal on Numerical Analysis, 52(1):140–155, January 2014.10.1137/130918599.URL https://doi.org/10.1137%2F130918599.
- Evans [2006]J.Evans.A scalable concurrent malloc(3) implementation for FreeBSD.In proceedings of BSDCan - The Technical BSD Conference, April 2006.URL https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf.Referenced on May 22nd 2024.
- Frank and Reich [2004]J.Frank and S.Reich.On Spurious Reflections, Nonuniform Grids and Finite Difference Discretizations of Wave Equations.CWI, Amsterdam, The Netherlands, 2004.
- Ganse etal. [2023]Urs Ganse, Tuomas Koskela, Markus Battarbee, Yann Pfau-Kempf, Konstantinos Papadakis, Markku Alho, Maarja Bussov, Giulia Cozzani, Maxime Dubart, Harriet George, Evgeny Gordeev, Maxime Grandin, Konstantinos Horaites, Jonas Suni, Vertti Tarvus, FasilTesema Kebede, Lucile Turc, Hongyang Zhou, and Minna Palmroth.Enabling technology for global 3D + 3V hybrid-Vlasov simulations of near-Earth space.Physics of Plasmas, 30(4):042902, 04 2023.ISSN 1070-664X.10.1063/5.0134387.URL https://doi.org/10.1063/5.0134387.
- Ganse etal. [2024]Urs Ganse, Yann Pfau-Kempf, Hongyang Zhou, Liisa Juusola, Abiyot Workayehu, Fasil Kebede, Konstantinos Papadakis, Maxime Grandin, Markku Alho, Markus Battarbee, Maxime Dubart, Venla Koikkalainen, Leo Kotipalo, Arnaud Lalagüe, Jonas Suni, Vertti Tarvus, Konstantinos Horaites, and Minna Palmroth.The vlasiator ionosphere – coupling a magnetospheric hybrid-vlasov simulation with a height-integrated ionosphere model.submitted to Annales Geophysicae, 2024.
- Grandin etal. [2023]Maxime Grandin, Thijs Luttikhuis, Markus Battarbee, Giulia Cozzani, Hongyang Zhou, Lucile Turc, Yann Pfau-Kempf, Harriet George, Konstantinos Horaites, Evgeny Gordeev, Urs Ganse, Konstantinos Papadakis, Markku Alho, Fasil Tesema, Jonas Suni, Maxime Dubart, Vertti Tarvus, and Minna Palmroth.First 3D hybrid-Vlasov global simulation of auroral proton precipitation and comparison with satellite observations.Journal of Space Weather and Space Climate, 13:20, 2023.ISSN 2115-7251.10.1051/SWSC/2023017.
- Honkonen etal. [2013]Ilja Honkonen, Sebastian von Alfthan, Arto Sandroos, Pekka Janhunen, and Minna Palmroth.Parallel grid library for rapid and flexible simulation development.Computer Physics Communications, 184(4):1297 – 1309, 2013.ISSN 0010-4655.10.1016/j.cpc.2012.12.017.
- Kotipalo etal. [2024]L.Kotipalo, M.Battarbee, Y.Pfau-Kempf, and M.Palmroth.Adaptive mesh refinement in vlasiator.Submitted to Jour. Nal., 2024.
- Londrillo and DelZanna [2004]PLondrillo and Luca DelZanna.On the divergence-free condition in godunov-type schemes for ideal magnetohydrodynamics: the upwind constrained transport method.Journal of Computational Physics, 195(1):17–48, March 2004.10.1016/j.jcp.2003.09.016.URL http://www.sciencedirect.com/science/article/pii/S0021999103005102.
- Luebke [2008]David Luebke.Cuda: Scalable parallel programming for high-performance scientific computing.In 2008 5th IEEE international symposium on biomedical imaging: from nano to macro, pages 836–838. IEEE, 2008.
- Matthes etal. [2017]A.Matthes, R.Widera, E.Zenker, B.Worpitz, A.Huebl, and M.Bussmann.Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the alpaka library.In ISC High Performance Workshops 2017, LNCS 10524, pp. 496-514, 2017, Jun 2017.URL https://arxiv.org/abs/1706.10086.
- Nishikawa etal. [2021]Kenichi Nishikawa, Ioana Duţan, Christoph Köhn, and Yosuke Mizuno.PIC methods in astrophysics: Simulations of relativistic jets and kinetic physics in astrophysical systems.Living Reviews in Computational Astrophysics, 7(1):1, July 2021.ISSN 2365-0524.10.1007/s41115-021-00012-0.
- Palmroth etal. [2018]Minna Palmroth, Urs Ganse, Yann Pfau-Kempf, Markus Battarbee, Lucile Turc, Thiago Brito, Maxime Grandin, Sanni Hoilijoki, Arto Sandroos, and Sebastian von Alfthan.Vlasov methods in space physics and astrophysics.Living Reviews in Computational Astrophysics, 4(1):1, August 2018.ISSN 2365-0524.10.1007/s41115-018-0003-2.
- Palmroth etal. [2023]Minna Palmroth, TuijaI. Pulkkinen, Urs Ganse, Yann Pfau-Kempf, Tuomas Koskela, Ivan Zaitsev, Markku Alho, Giulia Cozzani, Lucile Turc, Markus Battarbee, Maxime Dubart, Harriet George, Evgeniy Gordeev, Maxime Grandin, Konstantinos Horaites, Adnane Osmane, Konstantinos Papadakis, Jonas Suni, Vertti Tarvus, Hongyang Zhou, and Rumi Nakamura.Magnetotail plasma eruptions driven by magnetic reconnection and kinetic instabilities.Nature Geoscience 2023 16:7, 16(7):570–576, 6 2023.ISSN 1752-0908.10.1038/s41561-023-01206-2.
- Papadakis etal. [2022]K.Papadakis, Y.Pfau-Kempf, U.Ganse, M.Battarbee, M.Alho, M.Grandin, M.Dubart, L.Turc, H.Zhou, K.Horaites, I.Zaitsev, G.Cozzani, M.Bussov, E.Gordeev, F.Tesema, H.George, J.Suni, V.Tarvus, and M.Palmroth.Spatial filtering in a 6d hybrid-vlasov scheme for alleviating amr artifacts: a case study with vlasiator, versions 5.0, 5.1, 5.2.1.Geoscientific Instrumntation, Methods and Data Systems, 2022:1–18, 2022.10.5194/gmd-15-7903-2022.
- Papadakis etal. [2024a]K.Papadakis, M.Battarbee, U.Ganse, Y.Pfau-Kempf, and M.Palmroth.Hashinator: A portable hybrid hashmap designed for heterogeneous high performance computing.Accepted for publication in Frontiers in Computer Science Software, 2024a.10.3389/fcomp.2024.1407365.
- Papadakis etal. [2024b]Kostis Papadakis, Markus Battarbee, and René Widera.fmihpc/hashinator: v1.0.1 hashinator stable, 2024b.URL https://zenodo.org/doi/10.5281/zenodo.11396297.
- Pekkilä etal. [2022]Johannes Pekkilä, MiikkaS. Väisälä, MaaritJ. Käpylä, Matthias Rheinhardt, and Oskar Lappi.Scalable communication for high-order stencil computations using cuda-aware mpi.Parallel Computing, 111:102904, 2022.ISSN 0167-8191.https://doi.org/10.1016/j.parco.2022.102904.URL https://www.sciencedirect.com/science/article/pii/S0167819122000102.
- Pfau-Kempf etal. [2018]Yann Pfau-Kempf, Markus Battarbee, Urs Ganse, Sanni Hoilijoki, Lucile Turc, Sebastian von Alfthan, Rami Vainio, and Minna Palmroth.On the importance of spatial and velocity resolution in the hybrid-vlasov modeling of collisionless shocks.Frontiers in Physics, 2018.10.3389/fphy.2018.00044.
- Pfau-Kempf etal. [2020]Yann Pfau-Kempf, Minna Palmroth, Andreas Johlander, Lucile Turc, Markku Alho, Markus Battarbee, Maxime Dubart, Maxime Grandin, and Urs Ganse.Hybrid-vlasov modeling of three-dimensional dayside magnetopause reconnection.Physics of Plasmas, 27(9):092903, 2020.10.1063/5.0020685.
- Pfau-Kempf etal. [2022]Yann Pfau-Kempf, Sebastian von Alfthan, Urs Ganse, Arto Sandroos, Markus Battarbee, Tuomas Koskela, OttoAkseli Hannuksela, Ilja Honkonen, Kostis Papadakis, Leo Kotipalo, Hongyang Zhou, Maxime Grandin, Dimitry Pokhotelov, and Markku Alho.fmihpc/vlasiator: Vlasiator, 6 2022.URL https://doi.org/10.5281/zenodo.3640593.
- Sandroos etal. [2013]A.Sandroos, I.Honkonen, S.von Alfthan, and M.Palmroth.Multi-gpu simulations of vlasov’s equation using vlasiator.Parallel Computing, 39(8):306–318, 2013.ISSN 0167-8191.https://doi.org/10.1016/j.parco.2013.05.001.URL https://www.sciencedirect.com/science/article/pii/S0167819113000574.
- Strang [1968]Gilbert Strang.On the construction and comparison of difference schemes.SIAM Journal on Numerical Analysis, 5(3):506–517, September 1968.10.1137/0705041.URL https://doi.org/10.1137%2F0705041.
- Tarvus etal. [2024]V.Tarvus, L.Turc, H.Zhou, T.Nakamura, A.Settino, K.Blasl, G.Cozzani, U.Ganse, Y.Pfau-Kempf, M.Alho, M.Battarbee, M.Bussov, M.Dubart, E.Gordeev, F.Kebede, K.Papadakis, J.Suni, I.Zaitsev, and M.Palmroth.Hybrid-vlasov modelling of ion velocity distribution functions associated with the kelvin-helmholtz instability with a density and temperature asymmetry.Under revision at the astrophysical journal, 2024.
- Trott etal. [2022]ChristianR. Trott, Damien Lebrun-Grandié, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, Evan Harvey, DaisyS. Hollman, Dan Ibanez, Nevin Liber, Jonathan Madsen, Jeff Miles, David Poliakoff, Amy Powell, Sivasankaran Rajamanickam, Mikael Simberg, Dan Sunderland, Bruno Turcksin, and Jeremiah Wilke.Kokkos 3: Programming model extensions for the exascale era.IEEE Transactions on Parallel and Distributed Systems, 33(4):805–817, 2022.10.1109/TPDS.2021.3097283.
- Tsai etal. [2021]YuhsiangM. Tsai, Terry Cojean, Tobias Ribizel, and Hartwig Anzt.Preparing ginkgo for amd gpus – a testimonial on porting cuda code to hip.In Bartosz Balis, Dora B.Heras, Laura Antonelli, Andrea Bracciali, Thomas Gruber, Jin Hyun-Wook, Michael Kuhn, StephenL. Scott, Didem Unat, and Roman Wyrzykowski, editors, Euro-Par 2020: Parallel Processing Workshops, pages 109–121, Cham, 2021. Springer International Publishing.ISBN 978-3-030-71593-9.
- Videau [2022]Brice Videau.Bringing HIP to oneAPI, 2022.Accessed on December 1. 2023.
- Vijayaraghavan etal. [2017]Thiruvengadam Vijayaraghavan, Yasuko Eckert, GabrielH. Loh, MichaelJ. Schulte, Mike Ignatowski, BradfordM. Beckmann, WilliamC. Brantley, JosephL. Greathouse, Wei Huang, Arun Karunanithi, Onur Kayiran, Mitesh Meswani, Indrani Paul, Matthew Poremba, Steven Raasch, StevenK. Reinhardt, Greg Sadowski, and Vilas Sridharan.Design and analysis of an apu for exascale computing.In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 85–96, 2017.10.1109/HPCA.2017.42.
- Vlasov [1961]AnatolyA. Vlasov.Many-particle Theory and Its Application to Plasma.Gordon & Breach Science Publishers Ltd, 1961.ISBN 9780677203300.
- von Alfthan etal. [2014]S.von Alfthan, D.Pokhotelov, Y.Kempf, S.Hoilijoki, I.Honkonen, A.Sandroos, and M.Palmroth.Vlasiator: First global hybrid-vlasov simulations of earth’s foreshock and magnetosheath.Journal of Atmospheric and Solar-Terrestrial Physics, 120(0):24 – 35, 2014.ISSN 1364-6826.10.1016/j.jastp.2014.08.012.
- von Alfthan [2023]Sebastian von Alfthan.Phiprof – parallel hierarchical profiler.https://github.com/fmihpc/phiprof, 2023.
- Walker and Dongarra [1996]DavidW Walker and JackJ Dongarra.Mpi: a standard message passing interface.Supercomputer, 12:56–68, 1996.
- Wang etal. [2014]Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, and Xiaodong Zhang.Gdm: device memory management for gpgpu computing.In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, page 533–545, New York, NY, USA, 2014. Association for Computing Machinery.ISBN 9781450327893.10.1145/2591971.2592002.URL https://doi.org/10.1145/2591971.2592002.
- Yee [1966]Kane Yee.Numerical solution of initial boundary value problems involving maxwell’s equations in isotropic media.IEEE Transactions on Antennas and Propagation, 14(3):302–307, May 1966.ISSN 1558-2221.10.1109/TAP.1966.1138693.
- Zerroukat and Allen [2012]M.Zerroukat and T.Allen.A three-dimensional monotone and conservative semi-lagrangian scheme (slice-3d) for transport problems.Quarterly Journal of the Royal Meteorological Society, 138(667):1640–1651, 2012.10.1002/qj.1902.
- Zhang etal. [2019]Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, CyChan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, Max Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, and Michael Zingale.AMReX: a framework for block-structured adaptive mesh refinement.Journal of Open Source Software, 4(37):1370, May 2019.10.21105/joss.01370.URL https://doi.org/10.21105/joss.01370.
- Zheng etal. [2016]Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and StephenW. Keckler.Towards high performance paged memory for gpus.In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 345–357, 2016.10.1109/HPCA.2016.7446077.
- Zhu etal. [2018]Maohua Zhu, Youwei Zhuo, Chao Wang, Wenguang Chen, and Yuan Xie.Performance evaluation and optimization of hbm-enabled gpu for data-intensive applications.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(5):831–840, 2018.10.1109/TVLSI.2018.2791442.