Build an intake-ESM catalog from CESM timeseries files#

In this example, we will cover how to build a data catalog from Community Earth System Model (CESM) output. One of the requirements for using intake-esm is having a catalog which is comprised of two pieces:

  • A Comma Separated Value (CSV) file with relevant metadata (ex. file path, variable, stream, etc.)

  • A JSON file describing the contents of the CSV file, including how to combine compatible datasets into a single dataset.

Typically, these pieces are constructed “manually” using information within the file path, on a very ad-hoc basis. Also, these catalogs are typically only created for “larger”, community datasets, not neccessarily used within smaller model runs/daily workflows. A package (currently a prototype), called ecgtools works to solve the issues of generating these intake-esm catalogs. Ecgtools stands for Earth System Model (ESM) Catalog Generation tools. The current built-in catalog generation tools supported are:

  • CMIP6 models

  • CESM “history” files

  • CESM “timeseries” files

This example provides an overview of using ecgtools for parsing CESM timeseries file model output, and reading in the data using Intake-ESM. In this example, we use sample CESM data within the test directory for Intake-ESM.

Installing ecgtools#

You can install ecgtools through PyPI or conda-forge. Examples of the syntax are provided below:

Using the pip package manager:

$ python -m pip install ecgtools

Using the conda package manager that comes with the Anaconda/Miniconda distribution:

$ conda install ecgtools --channel conda-forge

Import packages#

The only parts of ecgtools we need are the Builder object and the parse_cesm_history parser from the CESM parsers! We import pathlib to take a look at the files we are parsing.

import pathlib

import intake
from ecgtools import Builder
from ecgtools.parsers.cesm import parse_cesm_timeseries

Understanding the directory structure#

The first step to setting up the Builder object is determining where your files are stored. As mentioned previously, we have a sample dataset of CESM2 model output, which is stored in test directory /tests/sample_data directory of this repository.

Taking a look at that directory, we see that there is a single case g.e11_LENS.GECOIAF.T62_g16.009

root_path = pathlib.Path('../../../tests/sample_data/cesm/').absolute()

sorted(root_path.rglob('*'))
[PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/v2022.9.18/docs/source/how-to/../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ECOSYS_XKW.024901-031612.nc'),
 PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/v2022.9.18/docs/source/how-to/../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ecosys.nday1.CaCO3_form_zint.02490101-03161231.nc'),
 PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/v2022.9.18/docs/source/how-to/../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2.024901-031612.nc')]

Now that we understand the directory structure, let’s make the catalog.

Build the catalog#

Let’s start by setting the builder object up.

cat_builder = Builder(
    # Directory with the output
    root_path=root_path,
    # Depth of 1 since we are sending it to the case output directory
    depth=1,
    # Exclude the timeseries and restart directories
    exclude_patterns=["*/tseries/*", "*/rest/*"],
    # Number of jobs to execute - should be equal to # threads you are using
    njobs=5,
)

cat_builder
Builder(root_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/v2022.9.18/docs/source/how-to/../../../tests/sample_data/cesm'), extension='.nc', depth=1, exclude_patterns=['*/tseries/*', '*/rest/*'], njobs=5)

We are good to go! Let’s build the catalog by calling .build() on the object, passing in the parse_cesm_history parser, which is a built-in parser for CESM history files.

cat_builder = cat_builder.build(parse_cesm_timeseries)
[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done   1 out of   1 | elapsed:    1.6s finished
[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/conda/v2022.9.18/lib/python3.10/site-packages/ecgtools/parsers/cesm.py:240: UserWarning: Using the default frequency definitions
  warnings.warn('Using the default frequency definitions')
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/conda/v2022.9.18/lib/python3.10/site-packages/ecgtools/parsers/cesm.py:240: UserWarning: Using the default frequency definitions
  warnings.warn('Using the default frequency definitions')
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/conda/v2022.9.18/lib/python3.10/site-packages/ecgtools/parsers/cesm.py:240: UserWarning: Using the default frequency definitions
  warnings.warn('Using the default frequency definitions')
[Parallel(n_jobs=5)]: Done   3 out of   3 | elapsed:    2.1s remaining:    0.0s
[Parallel(n_jobs=5)]: Done   3 out of   3 | elapsed:    2.1s finished

Inspect the built catalog#

Now that the catalog is built, we can inspect the dataframe which is used to create the catalog by calling .df on the builder object:

cat_builder.df
component stream case member_id variable start_time end_time time_range long_name units vertical_levels frequency path
0 ocn pop.h g.e11_LENS.GECOIAF.T62_g16.009 009 ECOSYS_XKW 0249-01 0316-12 024901-031612 XKW for ecosys fluxes cm/s 1 month_1 /home/docs/checkouts/readthedocs.org/user_buil...
1 ocn pop.h.ecosys.nday1 g.e11_LENS.GECOIAF.T62_g16.009 009 CaCO3_form_zint 0249-01-01 0316-12-31 02490101-03161231 CaCO3 Formation Vertical Integral mmol/m^3 cm/s 1 day_1 /home/docs/checkouts/readthedocs.org/user_buil...
2 ocn pop.h g.e11_LENS.GECOIAF.T62_g16.009 009 O2 0249-01 0316-12 024901-031612 Dissolved Oxygen mmol/m^3 1 month_1 /home/docs/checkouts/readthedocs.org/user_buil...

The resultant dataframe includes the:

  • Component

  • Stream

  • Case

  • Date

  • Frequency

  • Variables

  • Path

We can also check to see which files were not parsed by calling .invalid_assets

cat_builder.invalid_assets

This is empty, as expected!

Save the catalog#

Now that we have our data catalog, we can save it, by specifying the path to the comma separated values file (csv) or compressed csv (csv.gz).

cat_builder.save(
    '/tmp/cesm_sample_data.csv',
    # Column name including filepath
    path_column_name='path',
    # Column name including variables
    variable_column_name='variable',
    # Data file format - could be netcdf or zarr (in this case, netcdf)
    data_format="netcdf",
    # Which attributes to groupby when reading in variables using intake-esm
    groupby_attrs=["component", "stream", "case"],
    # Aggregations which are fed into xarray when reading in data using intake
    aggregations=[
        {'type': 'union', 'attribute_name': 'variable'},
        {
            "type": "join_existing",
            "attribute_name": "time_range",
            "options": {"dim": "time", "coords": "minimal", "compat": "override"},
        },
    ],
)
Saved catalog location: /tmp/cesm_sample_data.json and /tmp/cesm_sample_data.csv

Use the catalog to read in data#

data_catalog = intake.open_esm_datastore("/tmp/cesm_sample_data.json")
data_catalog

catalog with 2 dataset(s) from 3 asset(s):

unique
component 1
stream 2
case 1
member_id 1
variable 3
start_time 2
end_time 2
time_range 2
long_name 3
units 3
vertical_levels 1
frequency 2
path 3
derived_variable 0
dsets = data_catalog.to_dataset_dict()
dsets
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.stream.case'
100.00% [2/2 00:00<00:00]
{'ocn.pop.h.ecosys.nday1.g.e11_LENS.GECOIAF.T62_g16.009': <xarray.Dataset>
 Dimensions:          (nlat: 5, nlon: 5, time: 12)
 Coordinates:
     ULONG            (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
     TLONG            (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
     ULAT             (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
     TLAT             (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
   * time             (time) object 0249-01-02 00:00:00 ... 0249-01-13 00:00:00
 Dimensions without coordinates: nlat, nlon
 Data variables:
     CaCO3_form_zint  (time, nlat, nlon) float32 dask.array<chunksize=(12, 5, 5), meta=np.ndarray>
 Attributes: (12/16)
     intake_esm_vars:                   ['CaCO3_form_zint']
     intake_esm_attrs:component:        ocn
     intake_esm_attrs:stream:           pop.h.ecosys.nday1
     intake_esm_attrs:case:             g.e11_LENS.GECOIAF.T62_g16.009
     intake_esm_attrs:member_id:        9
     intake_esm_attrs:variable:         CaCO3_form_zint
     ...                                ...
     intake_esm_attrs:units:            mmol/m^3 cm/s
     intake_esm_attrs:vertical_levels:  1
     intake_esm_attrs:frequency:        day_1
     intake_esm_attrs:path:             /home/docs/checkouts/readthedocs.org/u...
     intake_esm_attrs:_data_format_:    netcdf
     intake_esm_dataset_key:            ocn.pop.h.ecosys.nday1.g.e11_LENS.GECO...,
 'ocn.pop.h.g.e11_LENS.GECOIAF.T62_g16.009': <xarray.Dataset>
 Dimensions:     (nlat: 5, nlon: 5, time: 12, sigma: 5)
 Coordinates:
     TLAT        (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
     ULONG       (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
     ULAT        (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
     TLONG       (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
   * time        (time) object 0249-02-01 00:00:00 ... 0250-01-01 00:00:00
   * sigma       (sigma) float64 23.4 23.45 23.5 23.55 23.6
 Dimensions without coordinates: nlat, nlon
 Data variables:
     ECOSYS_XKW  (time, nlat, nlon) float32 dask.array<chunksize=(12, 5, 5), meta=np.ndarray>
     O2          (time, sigma, nlat, nlon) float32 dask.array<chunksize=(12, 5, 5, 5), meta=np.ndarray>
 Attributes:
     intake_esm_attrs:component:        ocn
     intake_esm_attrs:stream:           pop.h
     intake_esm_attrs:case:             g.e11_LENS.GECOIAF.T62_g16.009
     intake_esm_attrs:member_id:        9
     intake_esm_attrs:start_time:       0249-01
     intake_esm_attrs:end_time:         0316-12
     intake_esm_attrs:time_range:       024901-031612
     intake_esm_attrs:vertical_levels:  1
     intake_esm_attrs:frequency:        month_1
     intake_esm_attrs:_data_format_:    netcdf
     intake_esm_dataset_key:            ocn.pop.h.g.e11_LENS.GECOIAF.T62_g16.009}

Let’s plot a quick figure from the dataset!

dsets['ocn.pop.h.ecosys.nday1.g.e11_LENS.GECOIAF.T62_g16.009'].CaCO3_form_zint.isel(time=0).plot();
../_images/51d191e26ee1f7109b59c3c7151ae450b25e11dc849984ec67a73f19903764d5.png

Conclusion#

Having the ability to easily create intake-esm catalogs from model outputs can be a powerful tool in your analysis toolkit. These data can be read in relatively quickly, easing the ability to quickly take a look at model output or even share your data with others! For more updates on ecgtools, be sure to follow the ecgtools repository on Github! Have an idea for another helpful parser? Submit an issue!

import intake_esm  # just to display version information
intake_esm.show_versions()
INSTALLED VERSIONS
------------------

cftime: 1.6.1
dask: 2022.6.1
fastprogress: 1.0.3
fsspec: 2022.8.2
gcsfs: 2022.8.2
intake: 0.6.6
intake_esm: 2022.9.18.post0+dirty
netCDF4: 1.6.1
pandas: 1.4.4
requests: 2.28.1
s3fs: 2022.8.2
xarray: 2022.6.0
zarr: 2.12.0