Blog

Earth projection map

GriddingMachine: A new database and software for sharing global datasets

Researchers are spending way too much time finding, reading, and processing public data. The ever increasing amount of data, various data formats, and different data layouts are increasing the time spent on handling data—before getting ready for scientific analysis. While the intention of sharing data is to facilitate their broad use and promote research, the increasing fragmentation makes it harder to find and access the data. Taking my personal experience as an example, I spent months to identify, download, and standardize the global datasets we use with the CliMA Land model, which came in a plethora of formats (e.g., NetCDF, GeoTIFF, CSV, and binary) that required different programming languages/packages to read them. Ordinarily, researchers would need to repeat this tedious work again and again.

information flow chart
Figure 1. Pathway to assemble and distribute the GriddingMachine database (figure from Wang et al. (2022)).

Enter GriddingMachine. GriddingMachine aims to minimize the effort involved in reusing data by

  • Collecting data from various sources,
  • Processing the data to a uniform format (NetCDF),
  • Storing the reprocessed data on public servers, and
  • Providing APIs to automatically download, manage, and read the data in multiple programming languages.

Each dataset is labeled with a unique tag that describes

  • Type of the dataset (e.g., leaf area index, biomass, etc.),
  • Spatial resolution (e.g., 5X means 0.2° × 0.2° grid),
  • Temporal resolution (e.g., 1Y means 1 year, 1M means 1 month),
  • Year of the data, and
  • Version of the data (from different publications).

Users can simply look up available dataset tags (and suggest new datasets) through our Github repository.

With the unique tag, one can query the data directly via the function query_collection, for example query_collection("LAI_MODIS_2X_8D_2020") for global MODIS leaf area index of the year 2020; this will download the dataset automatically from its original host server. Alternatively, one can query data only for a site using its latitude and longitude without downloading the dataset, for example, request_LUT("LAI_MODIS_2X_8D_2020_V1", 34.1478, -118.1445) for the 8-daily MODIS leaf area index in 2020 at Pasadena, CA, USA. See our paper (Wang et al., 2022) and the online documentation for more information about how to use GriddingMachine. It supports Julia, Matlab, Octave, Python, and R.

GriddingMachine makes it much easier to reuse public datasets, because users are shielded from the details of finding, formatting, and reading the data. We welcome contributions of additional gridded data to our collection.