API Reference

This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

ESM Datastore (intake.open_esm_datastore)

class intake_esm.core.esm_datastore(*args, **kwargs)[source]

An intake plugin for parsing an ESM (Earth System Model) Collection/catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.

Parameters
  • esmcol_obj (str, pandas.DataFrame) – If string, this must be a path or URL to an ESM collection JSON file. If pandas.DataFrame, this must be the catalog content that would otherwise be in a CSV file.

  • esmcol_data (dict, optional) – ESM collection spec information, by default None

  • progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True

  • sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’

  • csv_kwargs (dict, optional) – Additional keyword arguments passed through to the read_csv() function.

  • **kwargs – Additional keyword arguments are passed through to the Catalog base class.

Examples

At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():

>>> import intake
>>> url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
>>> col = intake.open_esm_datastore(url)
>>> col.df.head()
activity_id institution_id source_id experiment_id  ... variable_id grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1        ssp370  ...          pr         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1        ssp370  ...        prsn         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1        ssp370  ...         tas         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
3  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmax         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
4  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmin         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
classmethod from_df(df, esmcol_data=None, progressbar=True, sep='.', **kwargs)[source]

Create catalog from the given dataframe

Parameters
  • df (pandas.DataFrame) – catalog content that would otherwise be in a CSV file.

  • esmcol_data (dict, optional) – ESM collection spec information, by default None

  • progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True

  • sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’

Returns

esm_datastore – Catalog object

keys()[source]

Get keys for the catalog entries

Returns

list – keys for the catalog entries

nunique()[source]

Count distinct observations across dataframe columns in the catalog.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col.nunique()
activity_id          10
institution_id       23
source_id            48
experiment_id        29
member_id            86
table_id             19
variable_id         187
grid_label            7
zstore            27437
dcpp_init_year       59
dtype: int64
search(require_all_on=None, **query)[source]

Search for entries in the catalog.

Parameters
  • require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.

  • **query – keyword arguments corresponding to user’s query to execute against the dataframe.

Returns

cat (esm_datastore) – A new Catalog with a subset of the entries in this Catalog.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col.df.head(3)
activity_id institution_id source_id  ... grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
>>> cat = col.search(
...     source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"],
...     experiment_id=["historical", "ssp585"],
...     variable_id="pr",
...     table_id="Amon",
...     grid_label="gn",
... )
>>> cat.df.head(3)
    activity_id institution_id    source_id  ... grid_label                                             zstore dcpp_init_year
260        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i...            NaN
346        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i...            NaN
401        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i...            NaN

The search method also accepts compiled regular expression objects from compile() as patterns.

>>> import re
>>> # Let's search for variables containing "Frac" in their name
>>> pat = re.compile(r"Frac")  # Define a regular expression
>>> cat.search(variable_id=pat)
>>> cat.df.head().variable_id
0     residualFrac
1    landCoverFrac
2    landCoverFrac
3     residualFrac
4    landCoverFrac
serialize(name, directory=None, catalog_type='dict')[source]

Serialize collection/catalog to corresponding json and csv files.

Parameters
  • name (str) – name to use when creating ESM collection json file and csv catalog.

  • directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory

  • catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.

Notes

Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col_subset = col.search(
...     source_id="BCC-ESM1",
...     grid_label="gn",
...     table_id="Amon",
...     experiment_id="historical",
... )
>>> col_subset.serialize(name="cmip6_bcc_esm1", catalog_type="file")
Writing csv catalog to: cmip6_bcc_esm1.csv.gz
Writing ESM collection json file to: cmip6_bcc_esm1.json
to_dataset_dict(zarr_kwargs=None, cdf_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None)[source]

Load catalog entries into a dictionary of xarray datasets.

Parameters
  • zarr_kwargs (dict) – Keyword arguments to pass to open_zarr() function

  • cdf_kwargs (dict) – Keyword arguments to pass to open_dataset() function. If specifying chunks, the chunking is applied to each netcdf file. Therefore, chunks must refer to dimensions that are present in each netcdf file, or chunking will fail.

  • preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.

  • storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.

  • aggregate (bool, optional) – If False, no aggregation will be done.

Returns

dsets (dict) – A dictionary of xarray Dataset.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("glade-cmip6.json")
>>> cat = col.search(
...     source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"],
...     experiment_id=["historical", "ssp585"],
...     variable_id="pr",
...     table_id="Amon",
...     grid_label="gn",
... )
>>> dsets = cat.to_dataset_dict()
>>> dsets.keys()
dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn'])
>>> dsets["CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn"]
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
Coordinates:
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
unique(columns=None)[source]

Return unique values for given columns in the catalog.

Parameters

columns (str, list) – name of columns for which to get unique values

Returns

info (dict) – dictionary containing count, and unique values

Examples

>>> import intake
>>> import pprint
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> uniques = col.unique(columns=["activity_id", "source_id"])
>>> pprint.pprint(uniques)
{'activity_id': {'count': 10,
                'values': ['AerChemMIP',
                            'C4MIP',
                            'CMIP',
                            'DAMIP',
                            'DCPP',
                            'HighResMIP',
                            'LUMIP',
                            'OMIP',
                            'PMIP',
                            'ScenarioMIP']},
'source_id': {'count': 17,
            'values': ['BCC-ESM1',
                        'CNRM-ESM2-1',
                        'E3SM-1-0',
                        'MIROC6',
                        'HadGEM3-GC31-LL',
                        'MRI-ESM2-0',
                        'GISS-E2-1-G-CC',
                        'CESM2-WACCM',
                        'NorCPM1',
                        'GFDL-AM4',
                        'GFDL-CM4',
                        'NESM3',
                        'ECMWF-IFS-LR',
                        'IPSL-CM6A-ATM-HR',
                        'NICAM16-7S',
                        'GFDL-CM4C192',
                        'MPI-ESM1-2-HR']}}
update_aggregation(attribute_name, agg_type=None, options=None, delete=False)[source]

Updates aggregation operations info.

Parameters
  • attribute_name (str) – Name of attribute (column) across which to aggregate.

  • agg_type (str, optional) – Type of aggregation operation to apply. Valid values include: join_new, join_existing, union, by default None

  • options (dict, optional) – Aggregration settings that are passed as keywords arguments to concat() or merge(). For join_existing, it must contain the name of the existing dimension to use (for e.g.: something like {‘dim’: ‘time’})., by default None

  • delete (bool, optional) – Whether to delete/remove/disable aggregation operations for a particular attribute, by default False

property agg_columns

List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.

property data_format

The data format. Valid values are netcdf and zarr. If specified, it means that all data assets in the catalog use the same data format.

property df

Return pandas DataFrame.

property format_column_name

Name of the column which contains the data format.

property groupby_attrs

Dataframe columns used to determine groups of compatible datasets.

Returns

list – Columns used to determine groups of compatible datasets.

property key_template

Return string template used to create catalog entry keys

Returns

str – string template used to create catalog entry keys

property path_column_name

The name of the column containing the path to the asset.

property variable_column_name

Name of the column that contains the variable name.