API Reference#
This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
ESM Datastore (intake.open_esm_datastore)#
- class intake_esm.core.esm_datastore(*args, **kwargs)[source]
An intake plugin for parsing an ESM (Earth System Model) Catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.
- Parameters:
obj (
str,dict,ESMCatalogModel) – The ESM Catalog to use, or a path to a JSON file containing the catalog. If string, this must be a path or URL to an ESM catalog JSON file. If dict, this must be a dict representation of an ESM catalog. This dict must have two keys: ‘esmcat’ and ‘df’. The ‘esmcat’ key must be a dict representation of the ESM catalog and the ‘df’ key must be a Pandas DataFrame containing content that would otherwise be in a CSV file.sep (
str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’registry (
DerivedVariableRegistry, optional) – Registry of derived variables to use, by default None. If not provided, uses the default registry.read_kwargs (
dict, optional) – Additional keyword arguments passed through to thescan_csv()function, if the datastore is saved in csv format, orscan_parquet()if the datastore is saved in parquet format.read_csv_kwargs (
dict, optional) – Deprecated alias for read_kwargs.columns_with_iterables (
listofstr, optional) – A list of columns in the csv file containing iterables. Values in columns specified here will be converted with ast.literal_eval whenread_csv()is called (i.e., this is a shortcut to passing converters to read_kwargs).storage_options (
dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.intake_kwargs (
dict, optional) – Additional keyword arguments are passed through to theCatalogbase class.
Examples
At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():
>>> import intake >>> url = 'https://storage.googleapis.com/cmip6/pangeo-cmip6.json' >>> cat = intake.open_esm_datastore(url) >>> cat.df.head() activity_id institution_id source_id experiment_id ... variable_id grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ssp370 ... pr gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ssp370 ... prsn gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ssp370 ... tas gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 3 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmax gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 4 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmin gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
- __getitem__(key)[source]
This method takes a key argument and return a data source corresponding to assets (files) that will be aggregated into a single xarray dataset.
- Parameters:
key (
str) – key to use for catalog entry lookup- Returns:
intake_esm.source.ESMDataSource– A data source by name (key)- Raises:
KeyError – if key is not found.
Examples
>>> cat = intake.open_esm_datastore('mycatalog.json') >>> data_source = cat['AerChemMIP.BCC.BCC-ESM1.piClim-control.AERmon.gn']
- keys_info()[source]
Get keys for the catalog entries and their metadata
- Returns:
pandas.DataFrame– keys for the catalog entries and their metadata
Examples
>>> import intake >>> cat = intake.open_esm_datastore('./tests/sample-catalogs/cesm1-lens-netcdf.json') >>> cat.keys_info() component experiment stream key ocn.20C.pop.h ocn 20C pop.h ocn.CTRL.pop.h ocn CTRL pop.h ocn.RCP85.pop.h ocn RCP85 pop.h
- nunique()[source]
Count distinct observations across dataframe columns in the catalog.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat.nunique() activity_id 10 institution_id 23 source_id 48 experiment_id 29 member_id 86 table_id 19 variable_id 187 grid_label 7 zstore 27437 dcpp_init_year 59 dtype: int64
- search(require_all_on=None, **query)[source]
Search for entries in the catalog.
- Parameters:
require_all_on (
list,str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.**query – keyword arguments corresponding to user’s query to execute against the dataframe.
- Returns:
cat (
esm_datastore) – A new Catalog with a subset of the entries in this Catalog.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
>>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> sub_cat.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 260 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i... NaN 346 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i... NaN 401 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i... NaN
The search method also accepts compiled regular expression objects from
compile()as patterns.>>> import re >>> # Let's search for variables containing "Frac" in their name >>> pat = re.compile(r'Frac') # Define a regular expression >>> cat.search(variable_id=pat) >>> cat.df.head().variable_id 0 residualFrac 1 landCoverFrac 2 landCoverFrac 3 residualFrac 4 landCoverFrac
- serialize(name, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]
Serialize catalog to corresponding json and csv files.
- Parameters:
name (
str) – name to use when creating ESM catalog json file and csv catalog.directory (
str,PathLike, defaultNone) – The path to the local directory. If None, use the current directorycatalog_type (
str, default'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.to_csv_kwargs (
dict, optional) – Additional keyword arguments passed through to theto_csv()method.json_dump_kwargs (
dict, optional) – Additional keyword arguments passed through to thedump()function.storage_options (
dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('pangeo-cmip6.json') >>> cat_subset = cat.search( ... source_id='BCC-ESM1', ... grid_label='gn', ... table_id='Amon', ... experiment_id='historical', ... ) >>> cat_subset.serialize(name='cmip6_bcc_esm1', catalog_type='file')
- to_dask(**kwargs)[source]
Convert result to an xarray dataset.
This is only possible if the search returned exactly one result.
- to_dataset_dict(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, threaded=None, **kwargs)[source]
Load catalog entries into a dictionary of xarray datasets.
Column values, dataset keys and requested variables are added as global attributes on the returned datasets. The names of these attributes can be customized with
intake_esm.utils.set_options.- Parameters:
xarray_open_kwargs (
dict) – Keyword arguments to pass toopen_dataset()functionxarray_combine_by_coords_kwargs (:
dict) – Keyword arguments to pass tocombine_by_coords()function.preprocess (
callable, optional) – If provided, call this function on each dataset prior to aggregation.storage_options (
dict, optional) – fsspec Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.progressbar (
bool) – If True, will print a progress bar to standard error (stderr) when loading assets intoDataset.aggregate (
bool, optional) – If False, no aggregation will be done.skip_on_error (
bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.threaded (
bool, optional) – If True, usedask.compute()to load datasets in parallel. If False, load datasets sequentially. If None, the environment variable ITK_ESM_THREADING will be used to determine the threading behavior, defaulting to True if the variable is not set. If a value is provided, it will override the environment variable determined default.
- Returns:
Examples
>>> import intake >>> cat = intake.open_esm_datastore('glade-cmip6.json') >>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> dsets = sub_cat.to_dataset_dict() >>> dsets.keys() dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn']) >>> dsets['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn'] <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
- to_datatree(xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None, skip_on_error=False, levels=None, **kwargs)[source]
Load catalog entries into a tree of xarray datasets.
- Parameters:
xarray_open_kwargs (
dict) – Keyword arguments to pass toopen_dataset()functionxarray_combine_by_coords_kwargs (:
dict) – Keyword arguments to pass tocombine_by_coords()function.preprocess (
callable, optional) – If provided, call this function on each dataset prior to aggregation.storage_options (
dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.progressbar (
bool) – If True, will print a progress bar to standard error (stderr) when loading assets intoDataset.aggregate (
bool, optional) – If False, no aggregation will be done.skip_on_error (
bool, optional) – If True, skip datasets that cannot be loaded and/or variables we are unable to derive.levels (
list[str], optional) – List of fields to use as the datatree nodes. WARNING: This will overwrite the fields used to create the unique aggregation keys.
- Returns:
dsets (
DataTree) – A tree of xarrayDataset.
Examples
>>> import intake >>> cat = intake.open_esm_datastore('glade-cmip6.json') >>> sub_cat = cat.search( ... source_id=['BCC-CSM2-MR', 'CNRM-CM6-1', 'CNRM-ESM2-1'], ... experiment_id=['historical', 'ssp585'], ... variable_id='pr', ... table_id='Amon', ... grid_label='gn', ... ) >>> dsets = sub_cat.to_datatree() >>> dsets['CMIP/BCC.BCC-CSM2-MR/historical/Amon/gn'].ds <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
- unique()[source]
Return unique values for given columns in the catalog.
- property df
Return pandas
DataFrame.
- property interactive
Use itables to display the catalog in an interactive table. Use polars for performance ideally. Fall back to pandas if not.
We have to explode columns with iterables, otherwise javascript stringifcation can cause ellipsis to be rendered directly into the interactive table, losing actual data and inserting junk.
- property key_template
Return string template used to create catalog entry keys
- Returns:
str– string template used to create catalog entry keys
ESM DataSource#
- class intake_esm.source.ESMDataSource(*args, **kwargs)[source]
- __init__(key, records, path_column_name, data_format, format_column_name, *, variable_column_name=None, aggregations=None, requested_variables=None, preprocess=None, storage_options=None, xarray_open_kwargs=None, xarray_combine_by_coords_kwargs=None, intake_kwargs=None, threaded)[source]
An intake compatible Data Source for ESM data.
- Parameters:
key (
str) – The key of the data source.records (
listofdict) – A list of records, each of which is a dictionary mapping column names to values.path_column_name (
str) – The column name of the path.data_format (
DataFormat) – The data format of the data.variable_column_name (
str, optional) – The column name of the variable name.aggregations (
listofAggregation, optional) – A list of aggregations to apply to the data.requested_variables (
listofstr, optional) – A list of variables to load.preprocess (
callable, optional) – A preprocessing function to apply to the data.storage_options (
dict, optional) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.xarray_open_kwargs (
dict, optional) – Keyword arguments to pass toopen_dataset()function.xarray_combine_by_coords_kwargs (
dict, optional) – Keyword arguments to pass tocombine_by_coords()function.intake_kwargs (
dict, optional) – Additional keyword arguments are passed through to theDataSourcebase class.threaded (bool , *optional*) – If True, use dask.compute to load datasets in parallel. If False, load datasets sequentially. If none, the environment variable ITK_ESM_THREADING will be used to determine the threading behavior, defaulting to True if the variable is not set.
- close()[source]
Delete open files from memory
- to_dask()[source]
Return xarray object (which will have chunks)
ESM Catalog#
- class intake_esm.cat.ESMCatalogModel(*, esmcat_version, attributes, assets, aggregation_control=None, id='', catalog_dict=None, catalog_file=None, description=None, title=None, last_updated=None)[source]
Pydantic model for the ESM data catalog defined in https://git.io/JBWoW
- classmethod load(json_file, storage_options=None, read_kwargs=None)[source]
Loads the catalog from a file
- Parameters:
json_file (
strorpathlib.Path) – The path to the json file containing the catalogstorage_options (
dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.read_kwargs (
dict, optional) – Additional keyword arguments passed through to theread_csv()function, if the datastore is saved in csv format, orread_parquet()if the datastore is saved in parquet format.
- model_post_init(context, /)
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Args:
self: The BaseModel instance. context: The context.
- nunique()[source]
Return a series of the number of unique values for each column in the catalog.
- save(name, *, directory=None, catalog_type='dict', to_csv_kwargs=None, json_dump_kwargs=None, storage_options=None)[source]
Save the catalog to a file.
- Parameters:
name (
str) – The name of the file to save the catalog to.directory (
str) – The directory or cloud storage bucket to save the catalog to. If None, use the current directory.catalog_type (
str) – The type of catalog to save. Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file. Valid options are ‘dict’ and ‘file’.to_csv_kwargs (
dict, optional) – Additional keyword arguments passed through to theto_csv()method.json_dump_kwargs (
dict, optional) – Additional keyword arguments passed through to thedump()function.storage_options (
dict) – fsspec parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
- search(*, query, require_all_on=None)[source]
Search for entries in the catalog.
- Parameters:
query (
dict, optional) – A dictionary of query parameters to execute against the dataframe.require_all_on (
list,str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.
- Returns:
catalog (
ESMCatalogModel) – A new catalog with the entries satisfying the query criteria.
- unique()[source]
Return a series of unique values for each column in the catalog.
- property columns_with_iterables
Return a set of columns that have iterables.
- property df
Return the pd.DataFrame containing the catalog, creating it if necessary
- property has_multiple_variable_assets
Return True if the catalog has multiple variable assets.
- property lf
Return a pl.LazyFrame containing the catalog, creating it if necessary
- model_config = {'arbitrary_types_allowed': True, 'validate_assignment': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- property pl_df
Return a pl.DataFrame containing the catalog, creating it if necessary
Query Model#
- class intake_esm.cat.QueryModel(*, query, columns, require_all_on=None)[source]
A Pydantic model to represent a query to be executed against a catalog.
- model_config = {'validate_assignment': False}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Derived Variable Registry#
- class intake_esm.derived.DerivedVariableRegistry[source]
Registry of derived variables
- __init__(*args, **kwargs)
- classmethod load(name, package=None)[source]
Load a DerivedVariableRegistry from a Python module/file
- Parameters:
name (
str) – The name of the module to load the DerivedVariableRegistry from.package (
str, optional) – The package to load the module from. This argument is required when performing a relative import. It specifies the package to use as the anchor point from which to resolve the relative import to an absolute import.
- Returns:
DerivedVariableRegistry– A DerivedVariableRegistry loaded from the Python module.
Notes
If you have a folder: /home/foo/pythonfiles, and you want to load a registry defined in registry.py, located in that directory, ensure to add your folder to the $PYTHONPATH before calling this function.
>>> import sys >>> sys.path.insert(0, '/home/foo/pythonfiles') >>> from intake_esm.derived import DerivedVariableRegistry >>> registsry = DerivedVariableRegistry.load('registry')
- search(variable)[source]
Search for a derived variable by name or list of names
- Parameters:
variable (
typing.Union[str,typing.List[str]]) – The name of the variable to search for.- Returns:
DerivedVariableRegistry– A DerivedVariableRegistry with the found variables.
- update_datasets(*, datasets, variable_key_name, skip_on_error=False)[source]
Given a dictionary of datasets, return a dictionary of datasets with the derived variables
- Parameters:
- Returns:
typing.Dict[str,xr.Dataset]– A dictionary of datasets with the derived variables applied.
- register[source]
Register a derived variable
- Parameters:
func (
typing.Callable) – The function to apply to the dependent variables.variable (
str) – The name of the variable to derive.query (
typing.Dict[str,typing.Union[typing.Any,typing.List[typing.Any]]]) – The query to use to retrieve dependent variables required to derive variable.prefer_derived (
bool,optional (default=False)) – Specify whether to compute this variable on datasets that already contain a variable of the same name. Default (False) is to leave the existing variable.
- Returns:
typing.Callable– The function that was registered.
Derived Variable#
- class intake_esm.derived.DerivedVariable(*, func, variable, query, prefer_derived)[source]
- dependent_variables(variable_key_name)[source]
Return a list of dependent variables for a given variable
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Frames Model#
- class intake_esm.cat.FramesModel(*, df=None, pl_df=None, lf=None)[source]
A Pydantic model to represent our collection of dataframes - pandas, polars, and lazyframe.
- ensure_some()[source]
Make sure that at least one of the dataframes is not None when the model is instantiated.
- nunique()[source]
Return a series of the number of unique values for each column in the catalog.
- property columns_with_iterables
Return a set of columns that have iterables, preferentially using self.lazy > self.polars > self.pandas to minimise overhead.
- property lazy
Return the polars LazyFrame, instantiating it if necessary.
- model_config = {'arbitrary_types_allowed': True, 'validate_assignment': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- property pandas
Return the pandas DataFrame, instantiating it if necessary.
- property polars
Return the polars DataFrame, instantiating it if necessary.
Options for dataset attributes#
- class intake_esm.utils.set_options(**kwargs)[source]
Set options for intake_esm in a controlled context.
Currently-supported options:
attrs_prefix: The prefix to use in the names of attributes constructed from the catalog’s columns when returning xarray Datasets. Default:intake_esm_attrs.dataset_key: Name of the global attribute where to store the dataset’s key. Default:intake_esm_dataset_key.vars_key: Name of the global attribute where to store the list of requested variables when opening a dataset. Default:intake_esm_vars.
Examples
You can use
set_optionseither as a context manager:>>> import intake >>> import intake_esm >>> cat = intake.open_esm_datastore('catalog.json') >>> with intake_esm.set_options(attrs_prefix='cat'): ... out = cat.to_dataset_dict()
Or to set global options:
>>> intake_esm.set_options(attrs_prefix='cat', vars_key='cat_vars')