API Reference#

This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

ESM Datastore (`intake.open_esm_datastore`)#

class intake_esm.core.esm_datastore(*args, **kwargs)[source]

An intake plugin for parsing an ESM (Earth System Model) Catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.

Parameters:

obj (str, dict) – If string, this must be a path or URL to an ESM catalog JSON file. If dict, this must be a dict representation of an ESM catalog. This dict must have two keys: ‘esmcat’ and ‘df’. The ‘esmcat’ key must be a dict representation of the ESM catalog and the ‘df’ key must be a Pandas DataFrame containing content that would otherwise be in a CSV file.
sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’
registry (DerivedVariableRegistry, optional) – Registry of derived variables to use, by default None. If not provided, uses the default registry.
read_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the read_csv() function.
columns_with_iterables (list of str, optional) – A list of columns in the csv file containing iterables. Values in columns specified here will be converted with ast.literal_eval when read_csv() is called (i.e., this is a shortcut to passing converters to read_csv_kwargs).
storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
intake_kwargs (dict, optional) – Additional keyword arguments are passed through to the Catalog base class.

Examples

At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():

>>> import intake
>>> url = 'https://storage.googleapis.com/cmip6/pangeo-cmip6.json'
>>> cat = intake.open_esm_datastore(url)
>>> cat.df.head()
activity_id institution_id source_id experiment_id  ... variable_id grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1        ssp370  ...          pr         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1        ssp370  ...        prsn         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1        ssp370  ...         tas         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
3  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmax         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
4  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmin         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN

__getitem__(key)[source]

This method takes a key argument and return a data source corresponding to assets (files) that will be aggregated into a single xarray dataset.

Parameters:: key (str) – key to use for catalog entry lookup
Returns:: intake_esm.source.ESMDataSource – A data source by name (key)
Raises:: KeyError – if key is not found.

Examples

>>> cat = intake.open_esm_datastore('mycatalog.json')
>>> data_source = cat['AerChemMIP.BCC.BCC-ESM1.piClim-control.AERmon.gn']

keys()[source]

Get keys for the catalog entries

Returns:: list – keys for the catalog entries

keys_info()[source]

Get keys for the catalog entries and their metadata

Returns:: pandas.DataFrame – keys for the catalog entries and their metadata

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('./tests/sample-catalogs/cesm1-lens-netcdf.json')
>>> cat.keys_info()
                component experiment stream
key
ocn.20C.pop.h         ocn        20C  pop.h
ocn.CTRL.pop.h        ocn       CTRL  pop.h
ocn.RCP85.pop.h       ocn      RCP85  pop.h

nunique()[source]

Count distinct observations across dataframe columns in the catalog.

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat.nunique()
activity_id          10
institution_id       23
source_id            48
experiment_id        29
member_id            86
table_id             19
variable_id         187
grid_label            7
zstore            27437
dcpp_init_year       59
dtype: int64

search(require_all_on=None, **query)[source]

Search for entries in the catalog.

Parameters:

require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.
**query – keyword arguments corresponding to user’s query to execute against the dataframe.

Returns:

cat (esm_datastore) – A new Catalog with a subset of the entries in this Catalog.

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat.df.head(3)
activity_id institution_id source_id  ... grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN

>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> sub_cat.df.head(3)
    activity_id institution_id    source_id  ... grid_label                                             zstore dcpp_init_year
260        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i...            NaN
346        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i...            NaN
401        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i...            NaN

The search method also accepts compiled regular expression objects from compile() as patterns.

>>> import re
>>> # Let's search for variables containing "Frac" in their name
>>> pat = re.compile(r'Frac')  # Define a regular expression
>>> cat.search(variable_id=pat)
>>> cat.df.head().variable_id
0     residualFrac
1    landCoverFrac
2    landCoverFrac
3     residualFrac
4    landCoverFrac

serialize(name, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]

Serialize catalog to corresponding json and csv files.

Parameters:

name (str) – name to use when creating ESM catalog json file and csv catalog.
directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory
catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.
to_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the to_csv() method.
json_dump_kwargs (dict, optional) – Additional keyword arguments passed through to the dump() function.
storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

Notes

Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('pangeo-cmip6.json')
>>> cat_subset = cat.search(
...     source_id='BCC-ESM1',
...     grid_label='gn',
...     table_id='Amon',
...     experiment_id='historical',
... )
>>> cat_subset.serialize(name='cmip6_bcc_esm1', catalog_type='file')

to_dask(**kwargs)[source]

Convert result to an xarray dataset.

This is only possible if the search returned exactly one result.

Parameters:: kwargs (dict) – Parameters forwarded to to_dataset_dict().
Returns:: Dataset

to_dataset_dict(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, **kwargs)[source]

Load catalog entries into a dictionary of xarray datasets.

Column values, dataset keys and requested variables are added as global attributes on the returned datasets. The names of these attributes can be customized with intake_esm.utils.set_options.

Parameters:

xarray_open_kwargs (dict) – Keyword arguments to pass to open_dataset() function
xarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to combine_by_coords() function.
preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
storage_options (dict, optional) – fsspec Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.
aggregate (bool, optional) – If False, no aggregation will be done.
skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.

Returns:

dsets (dict) – A dictionary of xarray Dataset.

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('glade-cmip6.json')
>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> dsets = sub_cat.to_dataset_dict()
>>> dsets.keys()
dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn'])
>>> dsets['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn']
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
Coordinates:
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>

to_datatree(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, levels=None, **kwargs)[source]

Load catalog entries into a tree of xarray datasets.

Parameters:

xarray_open_kwargs (dict) – Keyword arguments to pass to open_dataset() function
xarray_combine_by_coords_kwargs (: dict) – Keyword arguments to pass to combine_by_coords() function.
preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.
aggregate (bool, optional) – If False, no aggregation will be done.
skip_on_error (bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.
levels (list[str], optional) – List of fields to use as the datatree nodes. WARNING: This will overwrite the fields used to create the unique aggregation keys.

Returns:

dsets (DataTree) – A tree of xarray Dataset.

Examples

>>> import intake
>>> cat = intake.open_esm_datastore('glade-cmip6.json')
>>> sub_cat = cat.search(
...     source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'],
...     experiment_id=['historical', 'ssp585'],
...     variable_id='pr',
...     table_id='Amon',
...     grid_label='gn',
... )
>>> dsets = sub_cat.to_datatree()
>>> dsets['CMIP/BCC.BCC-CSM2-MR/historical/Amon/gn'].ds
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
Coordinates:
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>

unique()[source]: Return unique values for given columns in the catalog.

property df: Return pandas DataFrame.

property key_template

Return string template used to create catalog entry keys

Returns:: str – string template used to create catalog entry keys

ESM DataSource#

class intake_esm.source.ESMDataSource(*args, **kwargs)[source]

__init__(key, records, path_column_name, data_format, format_column_name, *, variable_column_name=None, aggregations=None, requested_variables=None, preprocess=None, storage_options=None, xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, intake_kwargs=None)[source]

An intake compatible Data Source for ESM data.

Parameters:

key (str) – The key of the data source.
records (list of dict) – A list of records, each of which is a dictionary mapping column names to values.
path_column_name (str) – The column name of the path.
data_format (DataFormat) – The data format of the data.
variable_column_name (str, optional) – The column name of the variable name.
aggregations (list of Aggregation, optional) – A list of aggregations to apply to the data.
requested_variables (list of str, optional) – A list of variables to load.
preprocess (callable, optional) – A preprocessing function to apply to the data.
storage_options (dict, optional) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
xarray_open_kwargs (dict, optional) – Keyword arguments to pass to open_dataset() function.
xarray_combine_by_coords_kwargs (dict, optional) – Keyword arguments to pass to combine_by_coords() function.
intake_kwargs (dict, optional) – Additional keyword arguments are passed through to the DataSource base class.

close()[source]: Delete open files from memory

to_dask()[source]: Return xarray object (which will have chunks)

ESM Catalog#

class intake_esm.cat.ESMCatalogModel(*, esmcat_version, attributes, assets, aggregation_control=None, id='', catalog_dict=None, catalog_file=None, description=None, title=None, last_updated=None)[source]

Pydantic model for the ESM data catalog defined in https://git.io/JBWoW

classmethod load(json_file, storage_options=None, read_csv_kwargs=None)[source]

Loads the catalog from a file

Parameters:

json_file (str or pathlib.Path) – The path to the json file containing the catalog
storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
read_csv_kwargs (dict) – Additional keyword arguments passed through to the read_csv() function.

model_post_init(__context)

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Args:: self: The BaseModel instance. __context: The context.

nunique()[source]: Return a series of the number of unique values for each column in the catalog.

save(name, *, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]

Save the catalog to a file.

Parameters:

name (str) – The name of the file to save the catalog to.
directory (str) – The directory or cloud storage bucket to save the catalog to. If None, use the current directory.
catalog_type (str) – The type of catalog to save. Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file. Valid options are ‘dict’ and ‘file’.
to_csv_kwargs (dict, optional) – Additional keyword arguments passed through to the to_csv() method.
json_dump_kwargs (dict, optional) – Additional keyword arguments passed through to the dump() function.
storage_options (dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

Notes

Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.

search(*, query, require_all_on=None)[source]

Search for entries in the catalog.

Parameters:

query (dict, optional) – A dictionary of query parameters to execute against the dataframe.
require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.

Returns:

catalog (ESMCatalogModel) – A new catalog with the entries satisfying the query criteria.

unique()[source]: Return a series of unique values for each column in the catalog.

property columns_with_iterables: Return a set of columns that have iterables.

property df: Return the dataframe.

property has_multiple_variable_assets: Return True if the catalog has multiple variable assets.

model_computed_fields = {}: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {'arbitrary_types_allowed': True, 'validate_assignment': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'aggregation_control': FieldInfo(annotation=Union[AggregationControl, NoneType], required=False), 'assets': FieldInfo(annotation=Assets, required=True), 'attributes': FieldInfo(annotation=list[Attribute], required=True), 'catalog_dict': FieldInfo(annotation=Union[list[dict], NoneType], required=False), 'catalog_file': FieldInfo(annotation=Union[Annotated[str, Strict(strict=True)], NoneType], required=False), 'description': FieldInfo(annotation=Union[Annotated[str, Strict(strict=True)], NoneType], required=False), 'esmcat_version': FieldInfo(annotation=str, required=True, metadata=[Strict(strict=True)]), 'id': FieldInfo(annotation=str, required=False, default=''), 'last_updated': FieldInfo(annotation=Union[datetime, date, NoneType], required=False), 'title': FieldInfo(annotation=Union[Annotated[str, Strict(strict=True)], NoneType], required=False)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

Query Model#

class intake_esm.cat.QueryModel(*, query, columns, require_all_on=None)[source]

A Pydantic model to represent a query to be executed against a catalog.

model_computed_fields = {}: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {'validate_assignment': False}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'columns': FieldInfo(annotation=list[str], required=True), 'query': FieldInfo(annotation=dict[Annotated[str, Strict(strict=True)], Union[Any, list[Any]]], required=True), 'require_all_on': FieldInfo(annotation=Union[str, list[Any], NoneType], required=False)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

Derived Variable Registry#

class intake_esm.derived.DerivedVariableRegistry[source]

Registry of derived variables

__init__(*args, **kwargs)

classmethod load(name, package=None)[source]

Load a DerivedVariableRegistry from a Python module/file

Parameters:

name (str) – The name of the module to load the DerivedVariableRegistry from.
package (str, optional) – The package to load the module from. This argument is required when performing a relative import. It specifies the package to use as the anchor point from which to resolve the relative import to an absolute import.

Returns:

DerivedVariableRegistry – A DerivedVariableRegistry loaded from the Python module.

Notes

If you have a folder: /home/foo/pythonfiles, and you want to load a registry defined in registry.py, located in that directory, ensure to add your folder to the $PYTHONPATH before calling this function.

>>> import sys
>>> sys.path.insert(0, '/home/foo/pythonfiles')
>>> from intake_esm.derived import DerivedVariableRegistry
>>> registsry = DerivedVariableRegistry.load('registry')

search(variable)[source]

Search for a derived variable by name or list of names

Parameters:: variable (typing.Union[str, typing.List[str]]) – The name of the variable to search for.
Returns:: DerivedVariableRegistry – A DerivedVariableRegistry with the found variables.

update_datasets(*, datasets, variable_key_name, skip_on_error=False)[source]

Given a dictionary of datasets, return a dictionary of datasets with the derived variables

Parameters:

datasets (typing.Dict[str, xr.Dataset]) – A dictionary of datasets to apply the derived variables to.
variable_key_name (str) – The name of the variable key used in the derived variable query
skip_on_error (bool, optional) – If True, skip variables that fail variable derivation.

Returns:

typing.Dict[str, xr.Dataset] – A dictionary of datasets with the derived variables applied.

register[source]

Parameters:

func (typing.Callable) – The function to apply to the dependent variables.
variable (str) – The name of the variable to derive.
query (typing.Dict[str, typing.Union[typing.Any, typing.List[typing.Any]]]) – The query to use to retrieve dependent variables required to derive variable.
prefer_derived (bool, optional (default=False)) – Specify whether to compute this variable on datasets that already contain a variable of the same name. Default (False) is to leave the existing variable.

Returns:

typing.Callable – The function that was registered.

Derived Variable#

class intake_esm.derived.DerivedVariable(*, func, variable, query, prefer_derived)[source]

dependent_variables(variable_key_name)[source]: Return a list of dependent variables for a given variable

model_computed_fields = {}: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'func': FieldInfo(annotation=Callable, required=True), 'prefer_derived': FieldInfo(annotation=bool, required=True), 'query': FieldInfo(annotation=dict[Annotated[str, Strict(strict=True)], Union[Any, list[Any]]], required=True), 'variable': FieldInfo(annotation=str, required=True, metadata=[Strict(strict=True)])}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

Options for dataset attributes#

class intake_esm.utils.set_options(**kwargs)[source]

Set options for intake_esm in a controlled context.

Currently-supported options:

attrs_prefix: The prefix to use in the names of attributes constructed from the catalog’s columns when returning xarray Datasets. Default: intake_esm_attrs.
dataset_key: Name of the global attribute where to store the dataset’s key. Default: intake_esm_dataset_key.
vars_key: Name of the global attribute where to store the list of requested variables when opening a dataset. Default: intake_esm_vars.

Examples

You can use set_options either as a context manager:

>>> import intake
>>> import intake_esm
>>> cat = intake.open_esm_datastore('catalog.json')
>>> with intake_esm.set_options(attrs_prefix='cat'):
...     out = cat.to_dataset_dict()

Or to set global options:

>>> intake_esm.set_options(attrs_prefix='cat', vars_key='cat_vars')

API Reference#

ESM Datastore (intake.open_esm_datastore)#

ESM DataSource#

ESM Catalog#

Query Model#

Derived Variable Registry#

Derived Variable#

Options for dataset attributes#

ESM Datastore (`intake.open_esm_datastore`)#