API Reference¶
This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
ESM Datastore (intake.open_esm_datastore
)¶
-
class
intake_esm.core.
esm_datastore
(*args, **kwargs)[source]¶ An intake plugin for parsing an ESM (Earth System Model) Collection/catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.
- Parameters
esmcol_obj (
str
,pandas.DataFrame
) – If string, this must be a path or URL to an ESM collection JSON file. If pandas.DataFrame, this must be the catalog content that would otherwise be in a CSV file.esmcol_data (
dict
, optional) – ESM collection spec information, by default Noneprogressbar (
bool
, optional) – Will print a progress bar to standard error (stderr) when loading assets intoDataset
, by default Truesep (
str
, optional) – Delimiter to use when constructing a key for a query, by default ‘.’csv_kwargs (
dict
, optional) – Additional keyword arguments passed through to theread_csv()
function.**kwargs – Additional keyword arguments are passed through to the
Catalog
base class.
Examples
At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():
>>> import intake >>> url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json" >>> col = intake.open_esm_datastore(url) >>> col.df.head() activity_id institution_id source_id experiment_id ... variable_id grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ssp370 ... pr gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ssp370 ... prsn gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ssp370 ... tas gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 3 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmax gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 4 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmin gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
-
classmethod
from_df
(df, esmcol_data=None, progressbar=True, sep='.', **kwargs)[source]¶ Create catalog from the given dataframe
- Parameters
df (
pandas.DataFrame
) – catalog content that would otherwise be in a CSV file.esmcol_data (
dict
, optional) – ESM collection spec information, by default Noneprogressbar (
bool
, optional) – Will print a progress bar to standard error (stderr) when loading assets intoDataset
, by default Truesep (
str
, optional) – Delimiter to use when constructing a key for a query, by default ‘.’
- Returns
esm_datastore
– Catalog object
-
nunique
()[source]¶ Count distinct observations across dataframe columns in the catalog.
Examples
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col.nunique() activity_id 10 institution_id 23 source_id 48 experiment_id 29 member_id 86 table_id 19 variable_id 187 grid_label 7 zstore 27437 dcpp_init_year 59 dtype: int64
-
search
(require_all_on=None, **query)[source]¶ Search for entries in the catalog.
- Parameters
require_all_on (
list
,str
, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.**query – keyword arguments corresponding to user’s query to execute against the dataframe.
- Returns
cat (
esm_datastore
) – A new Catalog with a subset of the entries in this Catalog.
Examples
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
>>> cat = col.search( ... source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"], ... experiment_id=["historical", "ssp585"], ... variable_id="pr", ... table_id="Amon", ... grid_label="gn", ... ) >>> cat.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 260 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i... NaN 346 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i... NaN 401 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i... NaN
The search method also accepts compiled regular expression objects from
compile()
as patterns.>>> import re >>> # Let's search for variables containing "Frac" in their name >>> pat = re.compile(r"Frac") # Define a regular expression >>> cat.search(variable_id=pat) >>> cat.df.head().variable_id 0 residualFrac 1 landCoverFrac 2 landCoverFrac 3 residualFrac 4 landCoverFrac
-
serialize
(name, directory=None, catalog_type='dict')[source]¶ Serialize collection/catalog to corresponding json and csv files.
- Parameters
name (
str
) – name to use when creating ESM collection json file and csv catalog.directory (
str
,PathLike
, defaultNone
) – The path to the local directory. If None, use the current directorycatalog_type (
str
, default'dict'
) – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
Examples
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col_subset = col.search( ... source_id="BCC-ESM1", ... grid_label="gn", ... table_id="Amon", ... experiment_id="historical", ... ) >>> col_subset.serialize(name="cmip6_bcc_esm1", catalog_type="file") Writing csv catalog to: cmip6_bcc_esm1.csv.gz Writing ESM collection json file to: cmip6_bcc_esm1.json
-
to_dataset_dict
(zarr_kwargs=None, cdf_kwargs=None, preprocess=None, storage_options=None, progressbar=None, aggregate=None)[source]¶ Load catalog entries into a dictionary of xarray datasets.
- Parameters
zarr_kwargs (
dict
) – Keyword arguments to pass toopen_zarr()
functioncdf_kwargs (
dict
) – Keyword arguments to pass toopen_dataset()
function. If specifying chunks, the chunking is applied to each netcdf file. Therefore, chunks must refer to dimensions that are present in each netcdf file, or chunking will fail.preprocess (
callable
, optional) – If provided, call this function on each dataset prior to aggregation.storage_options (
dict
, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.progressbar (
bool
) – If True, will print a progress bar to standard error (stderr) when loading assets intoDataset
.aggregate (
bool
, optional) – If False, no aggregation will be done.
- Returns
Examples
>>> import intake >>> col = intake.open_esm_datastore("glade-cmip6.json") >>> cat = col.search( ... source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"], ... experiment_id=["historical", "ssp585"], ... variable_id="pr", ... table_id="Amon", ... grid_label="gn", ... ) >>> dsets = cat.to_dataset_dict() >>> dsets.keys() dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn']) >>> dsets["CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn"] <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
-
unique
(columns=None)[source]¶ Return unique values for given columns in the catalog.
- Parameters
columns (
str
,list
) – name of columns for which to get unique values- Returns
info (
dict
) – dictionary containing count, and unique values
Examples
>>> import intake >>> import pprint >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> uniques = col.unique(columns=["activity_id", "source_id"]) >>> pprint.pprint(uniques) {'activity_id': {'count': 10, 'values': ['AerChemMIP', 'C4MIP', 'CMIP', 'DAMIP', 'DCPP', 'HighResMIP', 'LUMIP', 'OMIP', 'PMIP', 'ScenarioMIP']}, 'source_id': {'count': 17, 'values': ['BCC-ESM1', 'CNRM-ESM2-1', 'E3SM-1-0', 'MIROC6', 'HadGEM3-GC31-LL', 'MRI-ESM2-0', 'GISS-E2-1-G-CC', 'CESM2-WACCM', 'NorCPM1', 'GFDL-AM4', 'GFDL-CM4', 'NESM3', 'ECMWF-IFS-LR', 'IPSL-CM6A-ATM-HR', 'NICAM16-7S', 'GFDL-CM4C192', 'MPI-ESM1-2-HR']}}
-
update_aggregation
(attribute_name, agg_type=None, options=None, delete=False)[source]¶ Updates aggregation operations info.
- Parameters
attribute_name (
str
) – Name of attribute (column) across which to aggregate.agg_type (
str
, optional) – Type of aggregation operation to apply. Valid values include: join_new, join_existing, union, by default Noneoptions (
dict
, optional) – Aggregration settings that are passed as keywords arguments toconcat()
ormerge()
. For join_existing, it must contain the name of the existing dimension to use (for e.g.: something like {‘dim’: ‘time’})., by default Nonedelete (
bool
, optional) – Whether to delete/remove/disable aggregation operations for a particular attribute, by default False
-
property
agg_columns
¶ List of columns used to merge/concatenate compatible multiple
Dataset
into a singleDataset
.
-
property
data_format
¶ The data format. Valid values are netcdf and zarr. If specified, it means that all data assets in the catalog use the same data format.
-
property
format_column_name
¶ Name of the column which contains the data format.
-
property
groupby_attrs
¶ Dataframe columns used to determine groups of compatible datasets.
- Returns
list
– Columns used to determine groups of compatible datasets.
-
property
key_template
¶ Return string template used to create catalog entry keys
- Returns
str
– string template used to create catalog entry keys
-
property
path_column_name
¶ The name of the column containing the path to the asset.
-
property
variable_column_name
¶ Name of the column that contains the variable name.