A Discussion with CliMA’s Lead Software Developer Simon Byrne on his Team’s Parallel Input/Output Work

CliMA’s software team, led by Simon Byrne, added a software interface to ClimaCore.jl for saving and loading data from distributed simulations. We caught up with Dr. Byrne near the Gong inside the CliMA conference room; an edited version of our interview is reproduced below.

Leilani Rivera-Dotson: Why did your team implement this interface?

Dr. Byrne: There are two main reasons we need to be able to save and load data: saving the quantities of interest, such as temperature and precipitation, for analysis and other post-processing tasks; capturing the state of the model so that we can reproduce or resume the simulation from a particular point in time. This is important if there is a hardware fault, or some other anomaly that we need to investigate. The saved files are commonly called restart files.

Our team had two key aims to aid usability:

The first was to make the restart file fully self-contained: it should be possible to load the data without having access to the input scripts or files used to generate it. This makes analysis much easier and less error-prone, but it does mean that we need to store all the mesh and other spatial information inside the file.
The second was to be able to save and reload the data on different numbers of processors, for example run a simulation on a large number of nodes, but then be able to load a snapshot of the data locally on a laptop. This is feasible as we do the partitioning based on a space-filling curve, and so we are able to store the data in the same order using different numbers of processors.

space-filling curve figure — Space-filling curve

Leilani Rivera-Dotson: How did your team approach the implementation?

Dr. Byrne: Data files and formats are tricky from a software point of view, as they are snapshots from a particular point in time. Unlike the code itself, which we modify and improve over time, adding new features, and improving interfaces, the data from a particular run will remain unchanged. We need to make sure that future versions of our software will have backward compatibility.

Another consideration is that the data files might outlive the lifetime of the software itself: how might a future user who is unable to run our code be able to access our data?

Leilani Rivera-Dotson: How did your team solve these problems?

Dr. Byrne: These are difficult things to guarantee, but there are several steps we took to make this feasible.

First, we chose to use a pre-existing container format, rather than inventing our own. We chose HDF5 for the initial version, as it is very widely used in scientific domains, and supports distributed MPI I/O, but there are many other candidates (Zarr, Adios 2) that we aim to support in the future.

Second, we developed a written schema of how the data is structured inside the container. While important as a reference, it will aid support for both supporting new container formats, as well as other users who might want to read the data we generate.

Third, we only save what we need: a lot of what changes over time are purely internal data structures that can be recomputed from the other information, so they don’t need to be saved in the file.

Finally, we store the version of the software which wrote the file, so that we can add code for supporting older files if we make changes to the format in future.

Leilani Rivera-Dotson: Any final thoughts for our readers?

Dr. Byrne: High-resolution climate simulations generate astronomical amounts of data, so it is critical to ensure that the data formats we use are both efficient and readily accessible by future scientists.