API Reference#

This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

ESM Datastore (intake.open_esm_datastore)#

class intake_esm.core.esm_datastore(*args, **kwargs)[source]

An intake plugin for parsing an ESM (Earth System Model) Catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.

  • obj (str, dict, ESMCatalogModel) – The ESM Catalog to use, or a path to a JSON file containing the catalog. If string, this must be a path or URL to an ESM catalog JSON file. If dict, this must be a dict representation of an ESM catalog. This dict must have two keys: ‘esmcat’ and ‘df’. The ‘esmcat’ key must be a dict representation of the ESM catalog and the ‘df’ key must be a Pandas DataFrame containing content that would otherwise be in a CSV file.

  • sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’

  • registry (DerivedVariableRegistry, optional) – Registry of derived variables to use, by default None. If not provided, uses the default registry.

  • read_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the read_csv() function.

  • columns_with_iterables (list of str, optional) – A list of columns in the csv file containing iterables. Values in columns specified here will be converted with ast.literal_eval when read_csv() is called (i.e., this is a shortcut to passing converters to read_csv_kwargs).

  • storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • intake_kwargs (dict, optional) – Additional keyword arguments are passed through to the Catalog base class.


At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():

>>> import intake
>>> url = 'https://storage.googleapis.com/cmip6/pangeo-cmip6.json'
>>> cat = intake.open_esm_datastore(url)
>>> cat.df.head()
activity_id institution_id source_id experiment_id  ... variable_id grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1        ssp370  ...          pr         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1        ssp370  ...        prsn         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1        ssp370  ...         tas         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
3  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmax         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
4  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmin         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN

This method takes a key argument and return a data source corresponding to assets (files) that will be aggregated into a single xarray dataset.


key (str) – key to use for catalog entry lookup


intake_esm.source.ESMDataSource – A data source by name (key)


KeyError – if key is not found.


>>> cat = intake.open_esm_datastore('mycatalog.json')
>>> data_source = cat['AerChemMIP.BCC.BCC-ESM1.piClim-control.AERmon.gn']

Get keys for the catalog entries


list – keys for the catalog entries


Get keys for the catalog entries and their metadata


pandas.DataFrame – keys for the catalog entries and their metadata


>>> import intake
>>> cat = intake.open_esm_datastore('./tests/sample-catalogs/cesm1-lens-netcdf.json')
>>> cat.keys_info()
                component experiment stream
ocn.20C.pop.h         ocn        20C  pop.h
ocn.CTRL.pop.h        ocn       CTRL  pop.h
ocn.RCP85.pop.h       ocn      RCP85  pop.h

Count distinct observations across dataframe columns in the catalog.


>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat.nunique()
activity_id          10
institution_id       23
source_id            48
experiment_id        29
member_id            86
table_id             19
variable_id         187
grid_label            7
zstore            27437
dcpp_init_year       59
dtype: int64
search(require_all_on=None, **query)[source]

Search for entries in the catalog.

  • require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.

  • **query – keyword arguments corresponding to user’s query to execute against the dataframe.


cat (esm_datastore) – A new Catalog with a subset of the entries in this Catalog.


>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat.df.head(3)
activity_id institution_id source_id  ... grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> sub_cat.df.head(3)
    activity_id institution_id    source_id  ... grid_label                                             zstore dcpp_init_year
260        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i...            NaN
346        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i...            NaN
401        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i...            NaN

The search method also accepts compiled regular expression objects from compile() as patterns.

>>> import re
>>> # Let's search for variables containing "Frac" in their name
>>> pat = re.compile(r'Frac')  # Define a regular expression
>>> cat.search(variable_id=pat)
>>> cat.df.head().variable_id
0     residualFrac
1    landCoverFrac
2    landCoverFrac
3     residualFrac
4    landCoverFrac
serialize(name, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]

Serialize catalog to corresponding json and csv files.

  • name (str) – name to use when creating ESM catalog json file and csv catalog.

  • directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory

  • catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.

  • to_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the to_csv() method.

  • json_dump_kwargs (dict, optional) – Additional keyword arguments passed through to the dump() function.

  • storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.


Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.


>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat_subset = cat.search(
...     source_id='BCC-ESM1',
...     grid_label='gn',
...     table_id='Amon',
...     experiment_id='historical',
... )
>>> cat_subset.serialize(name='cmip6_bcc_esm1', catalog_type='file')

Convert result to an xarray dataset.

This is only possible if the search returned exactly one result.


kwargs (dict) – Parameters forwarded to to_dataset_dict().



to_dataset_dict(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, **kwargs)[source]

Load catalog entries into a dictionary of xarray datasets.

Column values, dataset keys and requested variables are added as global attributes on the returned datasets. The names of these attributes can be customized with intake_esm.utils.set_options.

  • xarray_open_kwargs (dict) – Keyword arguments to pass to open_dataset() function

  • xarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to combine_by_coords() function.

  • preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.

  • storage_options (dict, optional) – fsspec Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.

  • aggregate (bool, optional) – If False, no aggregation will be done.

  • skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.


dsets (dict) – A dictionary of xarray Dataset.


>>> import intake
>>> cat = intake.open_esm_datastore('glade-cmip6.json')
>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> dsets = sub_cat.to_dataset_dict()
>>> dsets.keys()
dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn'])
>>> dsets['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn']
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
to_datatree(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, levels=None, **kwargs)[source]

Load catalog entries into a tree of xarray datasets.

  • xarray_open_kwargs (dict) – Keyword arguments to pass to open_dataset() function

  • xarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to combine_by_coords() function.

  • preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.

  • storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.

  • aggregate (bool, optional) – If False, no aggregation will be done.

  • skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.

  • levels (list[str], optional) – List of fields to use as the datatree nodes. WARNING: This will overwrite the fields used to create the unique aggregation keys.


dsets (DataTree) – A tree of xarray Dataset.


>>> import intake
>>> cat = intake.open_esm_datastore('glade-cmip6.json')
>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> dsets = sub_cat.to_datatree()
>>> dsets['CMIP/BCC.BCC-CSM2-MR/historical/Amon/gn'].ds
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>

Return unique values for given columns in the catalog.

property df

Return pandas DataFrame.

property key_template

Return string template used to create catalog entry keys


str – string template used to create catalog entry keys

ESM DataSource#

class intake_esm.source.ESMDataSource(*args, **kwargs)[source]
__init__(key, records, path_column_name, data_format, format_column_name, *, variable_column_name=None, aggregations=None, requested_variables=None, preprocess=None, storage_options=None, xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, intake_kwargs=None)[source]

An intake compatible Data Source for ESM data.

  • key (str) – The key of the data source.

  • records (list of dict) – A list of records, each of which is a dictionary mapping column names to values.

  • path_column_name (str) – The column name of the path.

  • data_format (DataFormat) – The data format of the data.

  • variable_column_name (str, optional) – The column name of the variable name.

  • aggregations (list of Aggregation, optional) – A list of aggregations to apply to the data.

  • requested_variables (list of str, optional) – A list of variables to load.

  • preprocess (callable, optional) – A preprocessing function to apply to the data.

  • storage_options (dict, optional) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • xarray_open_kwargs (dict, optional) – Keyword arguments to pass to open_dataset() function.

  • xarray_combine_by_coords_kwargs (dict, optional) – Keyword arguments to pass to combine_by_coords() function.

  • intake_kwargs (dict, optional) – Additional keyword arguments are passed through to the DataSource base class.


Delete open files from memory


Return xarray object (which will have chunks)

ESM Catalog#

class intake_esm.cat.ESMCatalogModel(*, esmcat_version, attributes, assets, aggregation_control=None, id='', catalog_dict=None, catalog_file=None, description=None, title=None, last_updated=None)[source]

Pydantic model for the ESM data catalog defined in https://git.io/JBWoW

classmethod load(json_file, storage_options=None, read_csv_kwargs=None)[source]

Loads the catalog from a file

  • json_file (str or pathlib.Path) – The path to the json file containing the catalog

  • storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • read_csv_kwargs (dict) – Additional keyword arguments passed through to the read_csv() function.

model_post_init(context, /)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.


self: The BaseModel instance. context: The context.


Return a series of the number of unique values for each column in the catalog.

save(name, *, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]

Save the catalog to a file.

  • name (str) – The name of the file to save the catalog to.

  • directory (str) – The directory or cloud storage bucket to save the catalog to. If None, use the current directory.

  • catalog_type (str) – The type of catalog to save. Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file. Valid options are ‘dict’ and ‘file’.

  • to_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the to_csv() method.

  • json_dump_kwargs (dict, optional) – Additional keyword arguments passed through to the dump() function.

  • storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.


Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.

search(*, query, require_all_on=None)[source]

Search for entries in the catalog.

  • query (dict, optional) – A dictionary of query parameters to execute against the dataframe.

  • require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.


catalog (ESMCatalogModel) – A new catalog with the entries satisfying the query criteria.


Return a series of unique values for each column in the catalog.

property columns_with_iterables

Return a set of columns that have iterables.

property df

Return the dataframe.

property has_multiple_variable_assets

Return True if the catalog has multiple variable assets.

model_config = {'arbitrary_types_allowed': True, 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Query Model#

class intake_esm.cat.QueryModel(*, query, columns, require_all_on=None)[source]

A Pydantic model to represent a query to be executed against a catalog.

model_config = {'validate_assignment': False}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Derived Variable Registry#

class intake_esm.derived.DerivedVariableRegistry[source]

Registry of derived variables

__init__(*args, **kwargs)
classmethod load(name, package=None)[source]

Load a DerivedVariableRegistry from a Python module/file

  • name (str) – The name of the module to load the DerivedVariableRegistry from.

  • package (str, optional) – The package to load the module from. This argument is required when performing a relative import. It specifies the package to use as the anchor point from which to resolve the relative import to an absolute import.


DerivedVariableRegistry – A DerivedVariableRegistry loaded from the Python module.


If you have a folder: /home/foo/pythonfiles, and you want to load a registry defined in registry.py, located in that directory, ensure to add your folder to the $PYTHONPATH before calling this function.

>>> import sys
>>> sys.path.insert(0, '/home/foo/pythonfiles')
>>> from intake_esm.derived import DerivedVariableRegistry
>>> registsry = DerivedVariableRegistry.load('registry')

Search for a derived variable by name or list of names


variable (typing.Union[str, typing.List[str]]) – The name of the variable to search for.


DerivedVariableRegistry – A DerivedVariableRegistry with the found variables.

update_datasets(*, datasets, variable_key_name, skip_on_error=False)[source]

Given a dictionary of datasets, return a dictionary of datasets with the derived variables

  • datasets (typing.Dict[str, xr.Dataset]) – A dictionary of datasets to apply the derived variables to.

  • variable_key_name (str) – The name of the variable key used in the derived variable query

  • skip_on_error (bool, optional) – If True, skip variables that fail variable derivation.


typing.Dict[str, xr.Dataset] – A dictionary of datasets with the derived variables applied.


Register a derived variable

  • func (typing.Callable) – The function to apply to the dependent variables.

  • variable (str) – The name of the variable to derive.

  • query (typing.Dict[str, typing.Union[typing.Any, typing.List[typing.Any]]]) – The query to use to retrieve dependent variables required to derive variable.

  • prefer_derived (bool, optional (default=False)) – Specify whether to compute this variable on datasets that already contain a variable of the same name. Default (False) is to leave the existing variable.


typing.Callable – The function that was registered.

Derived Variable#

class intake_esm.derived.DerivedVariable(*, func, variable, query, prefer_derived)[source]

Return a list of dependent variables for a given variable

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Options for dataset attributes#

class intake_esm.utils.set_options(**kwargs)[source]

Set options for intake_esm in a controlled context.

Currently-supported options:

  • attrs_prefix: The prefix to use in the names of attributes constructed from the catalog’s columns when returning xarray Datasets. Default: intake_esm_attrs.

  • dataset_key: Name of the global attribute where to store the dataset’s key. Default: intake_esm_dataset_key.

  • vars_key: Name of the global attribute where to store the list of requested variables when opening a dataset. Default: intake_esm_vars.


You can use set_options either as a context manager:

>>> import intake
>>> import intake_esm
>>> cat = intake.open_esm_datastore('catalog.json')
>>> with intake_esm.set_options(attrs_prefix='cat'):
...     out = cat.to_dataset_dict()

Or to set global options:

>>> intake_esm.set_options(attrs_prefix='cat', vars_key='cat_vars')