API Reference#
This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
ESM Datastore (intake.open_esm_datastore
)#
- class intake_esm.core.esm_datastore(*args, **kwargs)[source]
An intake plugin for parsing an ESM (Earth System Model) Catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.
- Parameters:
obj (
str
,dict
,ESMCatalogModel
) – The ESM Catalog to use, or a path to a JSON file containing the catalog. If string, this must be a path or URL to an ESM catalog JSON file. If dict, this must be a dict representation of an ESM catalog. This dict must have two keys: ‘esmcat’ and ‘df’. The ‘esmcat’ key must be a dict representation of the ESM catalog and the ‘df’ key must be a Pandas DataFrame containing content that would otherwise be in a CSV file.sep (
str
, optional) – Delimiter to use when constructing a key for a query, by default ‘.’registry (
DerivedVariableRegistry
, optional) – Registry of derived variables to use, by default None. If not provided, uses the default registry.read_csv_kwargs (
dict
, optional) – Additional keyword arguments passed through to theread_csv()
function.columns_with_iterables (
list
ofstr
, optional) – A list of columns in the csv file containing iterables. Values in columns specified here will be converted with ast.literal_eval whenread_csv()
is called (i.e., this is a shortcut to passing converters to read_csv_kwargs).storage_options (
dict
, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.intake_kwargs (
dict
, optional) – Additional keyword arguments are passed through to theCatalog
base class.
Examples
At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():
>>> import intake >>> url = 'https://storage.googleapis.com/cmip6/pangeo-cmip6.json' >>> cat = intake.open_esm_datastore(url) >>> cat.df.head() activity_id institution_id source_id experiment_id ... variable_id grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ssp370 ... pr gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ssp370 ... prsn gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ssp370 ... tas gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 3 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmax gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 4 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmin gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
- __getitem__(key)[source]
This method takes a key argument and return a data source corresponding to assets (files) that will be aggregated into a single xarray dataset.
- Parameters:
key (
str
) – key to use for catalog entry lookup- Returns:
intake_esm.source.ESMDataSource
– A data source by name (key)- Raises:
KeyError – if key is not found.
Examples
>>> cat = intake.open_esm_datastore('mycatalog.json') >>> data_source = cat['AerChemMIP.BCC.BCC-ESM1.piClim-control.AERmon.gn']
- keys_info()[source]
Get keys for the catalog entries and their metadata
- Returns:
pandas.DataFrame
– keys for the catalog entries and their metadata
Examples
>>> import intake >>> cat = intake.open_esm_datastore('./tests/sample-catalogs/cesm1-lens-netcdf.json') >>> cat.keys_info() component experiment stream key ocn.20C.pop.h ocn 20C pop.h ocn.CTRL.pop.h ocn CTRL pop.h ocn.RCP85.pop.h ocn RCP85 pop.h
- nunique()[source]
Count distinct observations across dataframe columns in the catalog.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat.nunique() activity_id 10 institution_id 23 source_id 48 experiment_id 29 member_id 86 table_id 19 variable_id 187 grid_label 7 zstore 27437 dcpp_init_year 59 dtype: int64
- search(require_all_on=None, **query)[source]
Search for entries in the catalog.
- Parameters:
require_all_on (
list
,str
, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.**query – keyword arguments corresponding to user’s query to execute against the dataframe.
- Returns:
cat (
esm_datastore
) – A new Catalog with a subset of the entries in this Catalog.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
>>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> sub_cat.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 260 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i... NaN 346 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i... NaN 401 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i... NaN
The search method also accepts compiled regular expression objects from
compile()
as patterns.>>> import re >>> # Let's search for variables containing "Frac" in their name >>> pat = re.compile(r'Frac') # Define a regular expression >>> cat.search(variable_id=pat) >>> cat.df.head().variable_id 0 residualFrac 1 landCoverFrac 2 landCoverFrac 3 residualFrac 4 landCoverFrac
- serialize(name, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]
Serialize catalog to corresponding json and csv files.
- Parameters:
name (
str
) – name to use when creating ESM catalog json file and csv catalog.directory (
str
,PathLike
, defaultNone
) – The path to the local directory. If None, use the current directorycatalog_type (
str
, default'dict'
) – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.to_csv_kwargs (
dict
, optional) – Additional keyword arguments passed through to theto_csv()
method.json_dump_kwargs (
dict
, optional) – Additional keyword arguments passed through to thedump()
function.storage_options (
dict
) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat_subset = cat.search( ... source_id='BCC-ESM1', ... grid_label='gn', ... table_id='Amon', ... experiment_id='historical', ... ) >>> cat_subset.serialize(name='cmip6_bcc_esm1', catalog_type='file')
- to_dask(**kwargs)[source]
Convert result to an xarray dataset.
This is only possible if the search returned exactly one result.
- to_dataset_dict(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, **kwargs)[source]
Load catalog entries into a dictionary of xarray datasets.
Column values, dataset keys and requested variables are added as global attributes on the returned datasets. The names of these attributes can be customized with
intake_esm.utils.set_options
.- Parameters:
xarray_open_kwargs (
dict
) – Keyword arguments to pass toopen_dataset()
functionxarray_combine_by_coords_kwargs (:
dict
) – Keyword arguments to pass tocombine_by_coords()
function.preprocess (
callable
, optional) – If provided, call this function on each dataset prior to aggregation.storage_options (
dict
, optional) – fsspec Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.progressbar (
bool
) – If True, will print a progress bar to standard error (stderr) when loading assets intoDataset
.aggregate (
bool
, optional) – If False, no aggregation will be done.skip_on_error (
bool
, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.
- Returns:
Examples
>>> import intake >>> cat = intake.open_esm_datastore('glade-cmip6.json') >>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> dsets = sub_cat.to_dataset_dict() >>> dsets.keys() dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn']) >>> dsets['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn'] <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
- to_datatree(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, levels=None, **kwargs)[source]
Load catalog entries into a tree of xarray datasets.
- Parameters:
xarray_open_kwargs (
dict
) – Keyword arguments to pass toopen_dataset()
functionxarray_combine_by_coords_kwargs (:
dict
) – Keyword arguments to pass tocombine_by_coords()
function.preprocess (
callable
, optional) – If provided, call this function on each dataset prior to aggregation.storage_options (
dict
, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.progressbar (
bool
) – If True, will print a progress bar to standard error (stderr) when loading assets intoDataset
.aggregate (
bool
, optional) – If False, no aggregation will be done.skip_on_error (
bool
, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.levels (
list[str]
, optional) – List of fields to use as the datatree nodes. WARNING: This will overwrite the fields used to create the unique aggregation keys.
- Returns:
dsets (
DataTree
) – A tree of xarrayDataset
.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('glade-cmip6.json') >>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> dsets = sub_cat.to_datatree() >>> dsets['CMIP/BCC.BCC-CSM2-MR/historical/Amon/gn'].ds <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
- unique()[source]
Return unique values for given columns in the catalog.
- property df
Return pandas
DataFrame
.
- property key_template
Return string template used to create catalog entry keys
- Returns:
str
– string template used to create catalog entry keys
ESM DataSource#
- class intake_esm.source.ESMDataSource(*args, **kwargs)[source]
- __init__(key, records, path_column_name, data_format, format_column_name, *, variable_column_name=None, aggregations=None, requested_variables=None, preprocess=None, storage_options=None, xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, intake_kwargs=None)[source]
An intake compatible Data Source for ESM data.
- Parameters:
key (
str
) – The key of the data source.records (
list
ofdict
) – A list of records, each of which is a dictionary mapping column names to values.path_column_name (
str
) – The column name of the path.data_format (
DataFormat
) – The data format of the data.variable_column_name (
str
, optional) – The column name of the variable name.aggregations (
list
ofAggregation
, optional) – A list of aggregations to apply to the data.requested_variables (
list
ofstr
, optional) – A list of variables to load.preprocess (
callable
, optional) – A preprocessing function to apply to the data.storage_options (
dict
, optional) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.xarray_open_kwargs (
dict
, optional) – Keyword arguments to pass toopen_dataset()
function.xarray_combine_by_coords_kwargs (
dict
, optional) – Keyword arguments to pass tocombine_by_coords()
function.intake_kwargs (
dict
, optional) – Additional keyword arguments are passed through to theDataSource
base class.
- close()[source]
Delete open files from memory
- to_dask()[source]
Return xarray object (which will have chunks)
ESM Catalog#
- class intake_esm.cat.ESMCatalogModel(*, esmcat_version, attributes, assets, aggregation_control=None, id='', catalog_dict=None, catalog_file=None, description=None, title=None, last_updated=None)[source]
Pydantic model for the ESM data catalog defined in https://git.io/JBWoW
- classmethod load(json_file, storage_options=None, read_csv_kwargs=None)[source]
Loads the catalog from a file
- Parameters:
json_file (
str
orpathlib.Path
) – The path to the json file containing the catalogstorage_options (
dict
) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.read_csv_kwargs (
dict
) – Additional keyword arguments passed through to theread_csv()
function.
- model_post_init(context, /)
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Args:
self: The BaseModel instance. context: The context.
- nunique()[source]
Return a series of the number of unique values for each column in the catalog.
- save(name, *, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]
Save the catalog to a file.
- Parameters:
name (
str
) – The name of the file to save the catalog to.directory (
str
) – The directory or cloud storage bucket to save the catalog to. If None, use the current directory.catalog_type (
str
) – The type of catalog to save. Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file. Valid options are ‘dict’ and ‘file’.to_csv_kwargs (
dict
, optional) – Additional keyword arguments passed through to theto_csv()
method.json_dump_kwargs (
dict
, optional) – Additional keyword arguments passed through to thedump()
function.storage_options (
dict
) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
- search(*, query, require_all_on=None)[source]
Search for entries in the catalog.
- Parameters:
query (
dict
, optional) – A dictionary of query parameters to execute against the dataframe.require_all_on (
list
,str
, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.
- Returns:
catalog (
ESMCatalogModel
) – A new catalog with the entries satisfying the query criteria.
- unique()[source]
Return a series of unique values for each column in the catalog.
- property columns_with_iterables
Return a set of columns that have iterables.
- property df
Return the dataframe.
- property has_multiple_variable_assets
Return True if the catalog has multiple variable assets.
- model_config = {'arbitrary_types_allowed': True, 'validate_assignment': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Query Model#
- class intake_esm.cat.QueryModel(*, query, columns, require_all_on=None)[source]
A Pydantic model to represent a query to be executed against a catalog.
- model_config = {'validate_assignment': False}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Derived Variable Registry#
- class intake_esm.derived.DerivedVariableRegistry[source]
Registry of derived variables
- __init__(*args, **kwargs)
- classmethod load(name, package=None)[source]
Load a DerivedVariableRegistry from a Python module/file
- Parameters:
name (
str
) – The name of the module to load the DerivedVariableRegistry from.package (
str
, optional) – The package to load the module from. This argument is required when performing a relative import. It specifies the package to use as the anchor point from which to resolve the relative import to an absolute import.
- Returns:
DerivedVariableRegistry
– A DerivedVariableRegistry loaded from the Python module.
Notes
If you have a folder: /home/foo/pythonfiles, and you want to load a registry defined in registry.py, located in that directory, ensure to add your folder to the $PYTHONPATH before calling this function.
>>> import sys >>> sys.path.insert(0, '/home/foo/pythonfiles') >>> from intake_esm.derived import DerivedVariableRegistry >>> registsry = DerivedVariableRegistry.load('registry')
- search(variable)[source]
Search for a derived variable by name or list of names
- Parameters:
variable (
typing.Union[str
,typing.List[str]]
) – The name of the variable to search for.- Returns:
DerivedVariableRegistry
– A DerivedVariableRegistry with the found variables.
- update_datasets(*, datasets, variable_key_name, skip_on_error=False)[source]
Given a dictionary of datasets, return a dictionary of datasets with the derived variables
- Parameters:
- Returns:
typing.Dict[str
,xr.Dataset]
– A dictionary of datasets with the derived variables applied.
- register[source]
Register a derived variable
- Parameters:
func (
typing.Callable
) – The function to apply to the dependent variables.variable (
str
) – The name of the variable to derive.query (
typing.Dict[str
,typing.Union[typing.Any
,typing.List[typing.Any]]]
) – The query to use to retrieve dependent variables required to derive variable.prefer_derived (
bool
,optional (default=False)
) – Specify whether to compute this variable on datasets that already contain a variable of the same name. Default (False) is to leave the existing variable.
- Returns:
typing.Callable
– The function that was registered.
Derived Variable#
- class intake_esm.derived.DerivedVariable(*, func, variable, query, prefer_derived)[source]
- dependent_variables(variable_key_name)[source]
Return a list of dependent variables for a given variable
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Options for dataset attributes#
- class intake_esm.utils.set_options(**kwargs)[source]
Set options for intake_esm in a controlled context.
Currently-supported options:
attrs_prefix
: The prefix to use in the names of attributes constructed from the catalog’s columns when returning xarray Datasets. Default:intake_esm_attrs
.dataset_key
: Name of the global attribute where to store the dataset’s key. Default:intake_esm_dataset_key
.vars_key
: Name of the global attribute where to store the list of requested variables when opening a dataset. Default:intake_esm_vars
.
Examples
You can use
set_options
either as a context manager:>>> import intake >>> import intake_esm >>> cat = intake.open_esm_datastore('catalog.json') >>> with intake_esm.set_options(attrs_prefix='cat'): ... out = cat.to_dataset_dict()
Or to set global options:
>>> intake_esm.set_options(attrs_prefix='cat', vars_key='cat_vars')