Build an intake-ESM catalog from CESM timeseries files#
In this example, we will cover how to build a data catalog from Community Earth System Model (CESM) output. One of the requirements for using intake-esm is having a catalog which is comprised of two pieces:
A Comma Separated Value (CSV) file with relevant metadata (ex. file path, variable, stream, etc.)
A JSON file describing the contents of the CSV file, including how to combine compatible datasets into a single dataset.
Typically, these pieces are constructed “manually” using information within the file path, on a very ad-hoc basis. Also, these catalogs are typically only created for “larger”, community datasets, not neccessarily used within smaller model runs/daily workflows. A package (currently a prototype), called ecgtools works to solve the issues of generating these intake-esm catalogs. Ecgtools stands for Earth System Model (ESM) Catalog Generation tools. The current built-in catalog generation tools supported are:
CMIP6 models
CESM “history” files
CESM “timeseries” files
This example provides an overview of using ecgtools for parsing CESM timeseries file model output, and reading in the data using Intake-ESM. In this example, we use sample CESM data within the test directory for Intake-ESM.
Installing ecgtools#
You can install ecgtools through PyPI or conda-forge. Examples of the syntax are provided below:
Import packages#
The only parts of ecgtools we need are the Builder
object and the parse_cesm_history
parser from the CESM parsers! We import pathlib
to take a look at the files we are parsing.
import pathlib
import dask
import intake
from ecgtools import Builder
from ecgtools.parsers.cesm import parse_cesm_timeseries
Understanding the directory structure#
The first step to setting up the Builder
object is determining where your files are stored. As mentioned previously, we have a sample dataset of CESM2 model output, which is stored in test directory /tests/sample_data
directory of this repository.
Taking a look at that directory, we see that there is a single case g.e11_LENS.GECOIAF.T62_g16.009
root_path = pathlib.Path('../../../tests/sample_data/cesm/').absolute()
sorted(root_path.rglob('*'))
[PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/docs/source/how-to/../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ECOSYS_XKW.024901-031612.nc'),
PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/docs/source/how-to/../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ecosys.nday1.CaCO3_form_zint.02490101-03161231.nc'),
PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/docs/source/how-to/../../../tests/sample_data/cesm/g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2.024901-031612.nc')]
Now that we understand the directory structure, let’s make the catalog.
Build the catalog#
Let’s start by setting the builder object up.
cat_builder = Builder(
# Directory with the output
paths=['../../../tests/sample_data/cesm/'],
# Depth of 1 since we are sending it to the case output directory
depth=1,
# Exclude the timeseries and restart directories
exclude_patterns=["*/tseries/*", "*/rest/*"],
# Number of jobs to execute - should be equal to # threads you are using
joblib_parallel_kwargs={'n_jobs': -1},
)
cat_builder
Builder(paths=['../../../tests/sample_data/cesm/'], storage_options={}, depth=1, exclude_patterns=['*/tseries/*', '*/rest/*'], include_patterns=[], joblib_parallel_kwargs={'n_jobs': -1})
We are good to go! Let’s build the catalog by calling .build()
on the object, passing in the parse_cesm_history
parser, which is a built-in parser for CESM history files.
cat_builder = cat_builder.build(parsing_func=parse_cesm_timeseries)
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/conda/latest/lib/python3.11/site-packages/ecgtools/parsers/cesm.py:238: UserWarning: Using the default frequency definitions
warnings.warn('Using the default frequency definitions')
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/conda/latest/lib/python3.11/site-packages/ecgtools/parsers/cesm.py:238: UserWarning: Using the default frequency definitions
warnings.warn('Using the default frequency definitions')
Inspect the built catalog#
Now that the catalog is built, we can inspect the dataframe which is used to create the catalog by calling .df
on the builder object:
cat_builder.df
component | stream | case | member_id | variable | start_time | end_time | time_range | long_name | units | vertical_levels | frequency | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ocn | pop.h | g.e11_LENS.GECOIAF.T62_g16.009 | 009 | ECOSYS_XKW | 0249-01 | 0316-12 | 024901-031612 | XKW for ecosys fluxes | cm/s | 1 | month_1 | /home/docs/checkouts/readthedocs.org/user_buil... |
1 | ocn | pop.h.ecosys.nday1 | g.e11_LENS.GECOIAF.T62_g16.009 | 009 | CaCO3_form_zint | 0249-01-01 | 0316-12-31 | 02490101-03161231 | CaCO3 Formation Vertical Integral | mmol/m^3 cm/s | 1 | day_1 | /home/docs/checkouts/readthedocs.org/user_buil... |
2 | ocn | pop.h | g.e11_LENS.GECOIAF.T62_g16.009 | 009 | O2 | 0249-01 | 0316-12 | 024901-031612 | Dissolved Oxygen | mmol/m^3 | 1 | month_1 | /home/docs/checkouts/readthedocs.org/user_buil... |
The resultant dataframe includes the:
Component
Stream
Case
Date
Frequency
Variables
Path
We can also check to see which files were not parsed by calling .invalid_assets
cat_builder.invalid_assets
This is empty, as expected!
Save the catalog#
Now that we have our data catalog, we can save it, by specifying the path to the comma separated values file (csv
) or compressed csv (csv.gz
).
cat_builder.save(
name='cesm_sample_data',
directory='/tmp',
# Column name including filepath
path_column_name='path',
# Column name including variables
variable_column_name='variable',
# Data file format - could be netcdf or zarr (in this case, netcdf)
data_format="netcdf",
# Which attributes to groupby when reading in variables using intake-esm
groupby_attrs=["component", "stream", "case"],
# Aggregations which are fed into xarray when reading in data using intake
aggregations=[
{'type': 'union', 'attribute_name': 'variable'},
{
"type": "join_existing",
"attribute_name": "time_range",
"options": {"dim": "time", "coords": "minimal", "compat": "override"},
},
],
)
Successfully wrote ESM catalog json file to: file:///tmp/cesm_sample_data.json
Use the catalog to read in data#
data_catalog = intake.open_esm_datastore("/tmp/cesm_sample_data.json")
data_catalog
cesm_sample_data catalog with 2 dataset(s) from 3 asset(s):
unique | |
---|---|
component | 1 |
stream | 2 |
case | 1 |
member_id | 1 |
variable | 3 |
start_time | 2 |
end_time | 2 |
time_range | 2 |
long_name | 3 |
units | 3 |
vertical_levels | 1 |
frequency | 2 |
path | 3 |
derived_variable | 0 |
dsets = data_catalog.to_dataset_dict()
dsets
--> The keys in the returned dictionary of datasets are constructed as follows:
'component.stream.case'
{'ocn.pop.h.ecosys.nday1.g.e11_LENS.GECOIAF.T62_g16.009': <xarray.Dataset>
Dimensions: (nlat: 5, nlon: 5, time: 12)
Coordinates:
ULONG (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
TLONG (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
ULAT (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
TLAT (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
* time (time) object 0249-01-02 00:00:00 ... 0249-01-13 00:00:00
Dimensions without coordinates: nlat, nlon
Data variables:
CaCO3_form_zint (time, nlat, nlon) float32 dask.array<chunksize=(12, 5, 5), meta=np.ndarray>
Attributes: (12/16)
intake_esm_vars: ['CaCO3_form_zint']
intake_esm_attrs:component: ocn
intake_esm_attrs:stream: pop.h.ecosys.nday1
intake_esm_attrs:case: g.e11_LENS.GECOIAF.T62_g16.009
intake_esm_attrs:member_id: 9
intake_esm_attrs:variable: CaCO3_form_zint
... ...
intake_esm_attrs:units: mmol/m^3 cm/s
intake_esm_attrs:vertical_levels: 1
intake_esm_attrs:frequency: day_1
intake_esm_attrs:path: /home/docs/checkouts/readthedocs.org/u...
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: ocn.pop.h.ecosys.nday1.g.e11_LENS.GECO...,
'ocn.pop.h.g.e11_LENS.GECOIAF.T62_g16.009': <xarray.Dataset>
Dimensions: (nlat: 5, nlon: 5, time: 12, sigma: 5)
Coordinates:
TLAT (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
ULONG (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
ULAT (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
TLONG (nlat, nlon) float64 dask.array<chunksize=(5, 5), meta=np.ndarray>
* time (time) object 0249-02-01 00:00:00 ... 0250-01-01 00:00:00
* sigma (sigma) float64 23.4 23.45 23.5 23.55 23.6
Dimensions without coordinates: nlat, nlon
Data variables:
ECOSYS_XKW (time, nlat, nlon) float32 dask.array<chunksize=(12, 5, 5), meta=np.ndarray>
O2 (time, sigma, nlat, nlon) float32 dask.array<chunksize=(12, 5, 5, 5), meta=np.ndarray>
Attributes:
intake_esm_attrs:component: ocn
intake_esm_attrs:stream: pop.h
intake_esm_attrs:case: g.e11_LENS.GECOIAF.T62_g16.009
intake_esm_attrs:member_id: 9
intake_esm_attrs:start_time: 0249-01
intake_esm_attrs:end_time: 0316-12
intake_esm_attrs:time_range: 024901-031612
intake_esm_attrs:vertical_levels: 1
intake_esm_attrs:frequency: month_1
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: ocn.pop.h.g.e11_LENS.GECOIAF.T62_g16.009}
Let’s plot a quick figure from the dataset!
dsets['ocn.pop.h.ecosys.nday1.g.e11_LENS.GECOIAF.T62_g16.009'].CaCO3_form_zint.isel(time=0).plot();

Conclusion#
Having the ability to easily create intake-esm catalogs from model outputs can be a powerful tool in your analysis toolkit. These data can be read in relatively quickly, easing the ability to quickly take a look at model output or even share your data with others! For more updates on ecgtools, be sure to follow the ecgtools repository on Github! Have an idea for another helpful parser? Submit an issue!
Show code cell source
import intake_esm # just to display version information
intake_esm.show_versions()
Show code cell output
INSTALLED VERSIONS
------------------
cftime: 1.6.2
dask: 2023.7.0
fastprogress: 1.0.3
fsspec: 2023.6.0
gcsfs: 2023.6.0
intake: 0.7.0
intake_esm: 2023.7.7.post2+dirty
netCDF4: 1.6.4
pandas: 2.0.3
requests: 2.31.0
s3fs: 2023.6.0
xarray: 2023.6.0
zarr: 2.15.0