API Reference¶

This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

ESM Datastore (`intake.open_esm_datastore`)¶

class intake_esm.core.esm_datastore(*args, **kwargs)[source]¶

An intake plugin for parsing an ESM (Earth System Model) Collection/catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.

Parameters

esmcol_obj (str, pandas.DataFrame) – If string, this must be a path or URL to an ESM collection JSON file. If pandas.DataFrame, this must be the catalog content that would otherwise be in a CSV file.
esmcol_data (dict, optional) – ESM collection spec information, by default None
progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True
sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’
csv_kwargs (dict, optional) – Additional keyword arguments passed through to the read_csv() function.
**kwargs – Additional keyword arguments are passed through to the Catalog base class.

Examples

At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():

>>> import intake
>>> url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
>>> col = intake.open_esm_datastore(url)
>>> col.df.head()
activity_id institution_id source_id experiment_id  ... variable_id grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1        ssp370  ...          pr         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1        ssp370  ...        prsn         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1        ssp370  ...         tas         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
3  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmax         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
4  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmin         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN

classmethod from_df(df, esmcol_data=None, progressbar=True, sep='.', **kwargs)[source]¶

Create catalog from the given dataframe

Parameters

df (pandas.DataFrame) – catalog content that would otherwise be in a CSV file.
esmcol_data (dict, optional) – ESM collection spec information, by default None
progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True
sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’

Returns

esm_datastore – Catalog object

keys()[source]¶

Get keys for the catalog entries

Returns: list – keys for the catalog entries

nunique()[source]¶

Count distinct observations across dataframe columns in the catalog.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col.nunique()
activity_id          10
institution_id       23
source_id            48
experiment_id        29
member_id            86
table_id             19
variable_id         187
grid_label            7
zstore            27437
dcpp_init_year       59
dtype: int64

search(require_all_on=None, **query)[source]¶

Search for entries in the catalog.

Parameters

require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.
**query – keyword arguments corresponding to user’s query to execute against the dataframe.

Returns

cat (esm_datastore) – A new Catalog with a subset of the entries in this Catalog.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col.df.head(3)
activity_id institution_id source_id  ... grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN

>>> cat = col.search(
...     source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"],
...     experiment_id=["historical", "ssp585"],
...     variable_id="pr",
...     table_id="Amon",
...     grid_label="gn",
... )
>>> cat.df.head(3)
    activity_id institution_id    source_id  ... grid_label                                             zstore dcpp_init_year
260        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i...            NaN
346        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i...            NaN
401        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i...            NaN

The search method also accepts compiled regular expression objects from compile() as patterns.

>>> import re
>>> # Let's search for variables containing "Frac" in their name
>>> pat = re.compile(r"Frac")  # Define a regular expression
>>> cat.search(variable_id=pat)
>>> cat.df.head().variable_id
0     residualFrac
1    landCoverFrac
2    landCoverFrac
3     residualFrac
4    landCoverFrac

serialize(name, directory=None, catalog_type='dict')[source]¶

Serialize collection/catalog to corresponding json and csv files.

Parameters

name (str) – name to use when creating ESM collection json file and csv catalog.
directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory
catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.

Notes

Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col_subset = col.search(
...     source_id="BCC-ESM1",
...     grid_label="gn",
...     table_id="Amon",
...     experiment_id="historical",
... )
>>> col_subset.serialize(name="cmip6_bcc_esm1", catalog_type="file")
Writing csv catalog to: cmip6_bcc_esm1.csv.gz
Writing ESM collection json file to: cmip6_bcc_esm1.json

to_dataset_dict(zarr_kwargs=None, cdf_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None)[source]¶

Load catalog entries into a dictionary of xarray datasets.

Parameters

zarr_kwargs (dict) – Keyword arguments to pass to open_zarr() function
cdf_kwargs (dict) – Keyword arguments to pass to open_dataset() function. If specifying chunks, the chunking is applied to each netcdf file. Therefore, chunks must refer to dimensions that are present in each netcdf file, or chunking will fail.
preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.
aggregate (bool, optional) – If False, no aggregation will be done.

Returns

dsets (dict) – A dictionary of xarray Dataset.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("glade-cmip6.json")
>>> cat = col.search(
...     source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"],
...     experiment_id=["historical", "ssp585"],
...     variable_id="pr",
...     table_id="Amon",
...     grid_label="gn",
... )
>>> dsets = cat.to_dataset_dict()
>>> dsets.keys()
dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn'])
>>> dsets["CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn"]
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
Coordinates:
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>

unique(columns=None)[source]¶

Return unique values for given columns in the catalog.

Parameters: columns (str, list) – name of columns for which to get unique values
Returns: info (dict) – dictionary containing count, and unique values

Examples

>>> import intake
>>> import pprint
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> uniques = col.unique(columns=["activity_id", "source_id"])
>>> pprint.pprint(uniques)
{'activity_id': {'count': 10,
                'values': ['AerChemMIP',
                            'C4MIP',
                            'CMIP',
                            'DAMIP',
                            'DCPP',
                            'HighResMIP',
                            'LUMIP',
                            'OMIP',
                            'PMIP',
                            'ScenarioMIP']},
'source_id': {'count': 17,
            'values': ['BCC-ESM1',
                        'CNRM-ESM2-1',
                        'E3SM-1-0',
                        'MIROC6',
                        'HadGEM3-GC31-LL',
                        'MRI-ESM2-0',
                        'GISS-E2-1-G-CC',
                        'CESM2-WACCM',
                        'NorCPM1',
                        'GFDL-AM4',
                        'GFDL-CM4',
                        'NESM3',
                        'ECMWF-IFS-LR',
                        'IPSL-CM6A-ATM-HR',
                        'NICAM16-7S',
                        'GFDL-CM4C192',
                        'MPI-ESM1-2-HR']}}

update_aggregation(attribute_name, agg_type=None, options=None, delete=False)[source]¶

Updates aggregation operations info.

Parameters

attribute_name (str) – Name of attribute (column) across which to aggregate.
agg_type (str, optional) – Type of aggregation operation to apply. Valid values include: join_new, join_existing, union, by default None
options (dict, optional) – Aggregration settings that are passed as keywords arguments to concat() or merge(). For join_existing, it must contain the name of the existing dimension to use (for e.g.: something like {‘dim’: ‘time’})., by default None
delete (bool, optional) – Whether to delete/remove/disable aggregation operations for a particular attribute, by default False

property agg_columns¶: List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.

property data_format¶: The data format. Valid values are netcdf and zarr. If specified, it means that all data assets in the catalog use the same data format.

property df¶: Return pandas DataFrame.

property format_column_name¶: Name of the column which contains the data format.

property groupby_attrs¶

Dataframe columns used to determine groups of compatible datasets.

Returns: list – Columns used to determine groups of compatible datasets.

property key_template¶

Return string template used to create catalog entry keys

Returns: str – string template used to create catalog entry keys

property path_column_name¶: The name of the column containing the path to the asset.

property variable_column_name¶: Name of the column that contains the variable name.

API Reference¶

ESM Datastore (intake.open_esm_datastore)¶

ESM Datastore (`intake.open_esm_datastore`)¶