How to Choose When and Where to Acquire Data to Calibrate Climate Models

Climate models depend on dynamics across a huge range of spatial and temporal scales. Resolving all scales that matter for climate–from the scales of cloud droplets to planetary circulations–will be impossible for the foreseeable future. Therefore, it remains critical to link what is unresolvable to variables resolved on the model grid scale. Parameterization schemes are a tool to bridge such scales; they provide simplified representations of the smallest scales by introducing new empirical parameters. An important source of uncertainty in climate projections comes from uncertainty about these parameters, in addition to uncertainties about the structure of the parameterization schemes themselves. For example, Suzuki et al. (2013) and Zhao et al. (2016) showed how different choices of parameters affecting cloud microphysics can drastically alter the response of climate models to increases in atmospheric greenhouse gas concentrations.

One form of data: high-resolution simulations

Pacific Ocean map of locations sampled — Figure 1: Current sites of high-resolution cloud-resolving simulations in our library, focusing on low clouds over the East Pacific (Shen et al., 2022; Fig. 1).

Data produced computationally, in high-resolution simulations over limited areas and time periods, can be used to calibrate parameterizations of processes that are in principle computable, but for which current computational capacities do not suffice to resolve them globally. Such processes include, for example, the turbulent and convective dynamics of clouds, for which we are continuing to expand a library of high-resolution simulations (Figure 1). This raises the question when and where to conduct such high-resolution simulations, for them to be maximally informative about the parameterized processes. (Similar questions of experimental design also arise in the choice of sites for observational field campaigns or in the design of new satellite missions.).

Not all data sites are created equal

In Dunbar et al. (2022), we describe an algorithm that automates the optimal choice of sites for acquiring new data for calibration of parameterization schemes. We demonstrate the algorithm in a relatively simple proof-of-concept with an idealized general circulation model (GCM), developed by Frierson et al. (2006) and O’Gorman & Schneider (2008). The GCM has a convection scheme with two key parameters: a reference relative humidity and a convective relaxation time. To demonstrate the algorithm, we consider calibrating the two convection parameters by using the zonal- and time-mean model output statistics of the relative humidity in the mid-troposphere, the precipitation rate, and a measure of intense precipitation (Figure 2). We ask the question: suppose we can acquire data at only one latitude (the GCM is statistically zonally symmetric, so variations along longitude do not matter), which latitude is most informative about the convection parameters?

precipitation and latitude graph — Figure 2: Mean (black) and two standard deviations (gray) of three model-generated climate statistics: the relative humidity, the precipitation rate, and the probability of exceeding the 90th percentile of the daily precipitation in a control simulation (which is 10% in the control). Each colored disc represents a realization of model-generated data at one of four sites (vertical lines). (From Dunbar et al. 2022, Fig. 3.)

In Figure 2, four representative data samples for 4 latitudes are given in different colors. The resulting learnt joint distributions from calibrating convection parameters for each latitude are plotted in Figure 3. At the optimal latitude (-19°, colored yellow), around the subtropical precipitation minimum, the parameter distribution is well constrained; at the worst latitude (-75°, colored dark green) near the south pole, it is largely unconstrained (close to the prior distribution). As our aim is to reduce the uncertainty of calibrated parameters, the extent to which the posterior uncertainty of the constrained distributions is reduced from the prior indicates the information provided by data at a site.

timescale and humidity graphs — Figure 3: Shaded joint distribution of convection parameters when calibrating with data acquired from each of the four sites. The different shading levels (dark to light) contain 50%, 75%, and 99% of the total mass of the distribution. (From Dunbar et al. 2022, Fig. 7.)

A framework to automate decision making

To automate site selection for data acquisition so as to optimize the information gain, or uncertainty reduction, we devise an algorithm that maximizes an entropy-based measure of the utility of different sites. The algorithm builds on the calibrate-emulate-sample (CES) algorithm, which accelerates uncertainty quantification in complex computer models (open source code). The resulting algorithm is parallelizable and very efficient with respect to the number of model evaluations required to obtain optimal acquisition sites.

log(u) and latitude graph — Figure 4. Logarithm of utility of latitudes as produced from the automated algorithm. Colored discs show the scores of the four representative latitudes (Dunbar et al. 2022, Fig. 5).

The utility the algorithm assigns to any given location (or time period) is based on a Bayesian analysis of the parameter learning problem for each site. The utility reflects the global model’s sensitivity to new data at each site, a good indicator of information gain. The site-dependent sensitivity to parameters can be calculated from the global model at any site without actually acquiring any data. As data acquisition may involve running a high-resolution simulation, being able to choose sites prior to this is advantageous. Applying the algorithm to the idealized GCM, we find that the utility function (Figure 4) indeed identifies the correct optimal latitude to acquire data (-19°, yellow disc). In our paper, we also show that the optimal site-season pairs for data acquisition can be determined similarly in a seasonally varying configuration of the GCM.

Beyond physical intuition

This simple experiment with an idealized GCM already indicates that making decisions guided only by physical intuition, in data limited problems, may not always lead to optimal choices. One intelligent guess, for example, would be that the optimal site to learn parameters for convection lies near the ITCZ, where convective tendencies are largest (Figure 2, 3, 4; latitude -3°, brown disc). However, this is not necessarily the case. In the example in Figure 2, it is in fact the region around the subtropical precipitation minimum (latitude -19°, yellow disc) that is most sensitive to changes in model parameters and hence is the most useful site for additional data acquisition–presumably because the intense convective precipitation events occurring there on occasion are particularly informative about parameters in the convection scheme

Accelerating the science workflow

looped stages of scientific knowledge discovery — Figure 5. The scientific knowledge discovery loop: Models can be used to target data acquisition, whether this is experimental data, observations, or simulations; in turn, the acquired data can be used to revise and improve models. The whole loop can now be automated and iterated rapidly.

Our combined works (Dunbar et al. 2021, Dunbar et al. 2022, Howland et al. 2021, Shen et al. 2022) take steps toward automation of both (1) targeting data acquisition to optimize reduction of uncertainty in model parameters and (2) calibrating models and quantifying their uncertainties with the so-acquired data. These two stages form the scientific discovery loop (Figure 5). With a fully automated workflow in place, the loop can be iterated rapidly, perhaps thousands or tens of thousands of times in the case of learning about parameterizations in a climate model from high-resolution simulations. This can lead to rapid improvements in reducing parametric uncertainties in future climate models. Similar automated workflows are now becoming possible in many fields of science and engineering, as outlined in a recent report by the National Academies.