This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
intake.open_esm_datastore
intake_esm.core.
esm_datastore
An intake plugin for parsing an ESM (Earth System Model) Collection/catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.
esmcol_obj (str, pandas.DataFrame) – If string, this must be a path or URL to an ESM collection JSON file. If pandas.DataFrame, this must be the catalog content that would otherwise be in a CSV file.
esmcol_data (dict, optional) – ESM collection spec information, by default None
progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True
Dataset
sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’
csv_kwargs (dict, optional) – Additional keyword arguments passed through to the read_csv() function.
read_csv()
**kwargs – Additional keyword arguments are passed through to the Catalog base class.
Catalog
Examples
At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():
>>> import intake >>> url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" >>> col = intake.open_esm_datastore(url) >>> col.df.head() activity_id institution_id source_id experiment_id ... variable_id grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ssp370 ... pr gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ssp370 ... prsn gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ssp370 ... tas gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 3 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmax gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 4 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmin gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
from_df
Create catalog from the given dataframe
df (pandas.DataFrame) – catalog content that would otherwise be in a CSV file.
esm_datastore – Catalog object
keys
Get keys for the catalog entries
list – keys for the catalog entries
nunique
Count distinct observations across dataframe columns in the catalog.
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col.nunique() activity_id 10 institution_id 23 source_id 48 experiment_id 29 member_id 86 table_id 19 variable_id 187 grid_label 7 zstore 27437 dcpp_init_year 59 dtype: int64
search
Search for entries in the catalog.
require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.
**query – keyword arguments corresponding to user’s query to execute against the dataframe.
cat (esm_datastore) – A new Catalog with a subset of the entries in this Catalog.
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
>>> cat = col.search( ... source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"], ... experiment_id=["historical", "ssp585"], ... variable_id="pr", ... table_id="Amon", ... grid_label="gn", ... ) >>> cat.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 260 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i... NaN 346 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i... NaN 401 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i... NaN
The search method also accepts compiled regular expression objects from compile() as patterns.
compile()
>>> import re >>> # Let's search for variables containing "Frac" in their name >>> pat = re.compile(r"Frac") # Define a regular expression >>> cat.search(variable_id=pat) >>> cat.df.head().variable_id 0 residualFrac 1 landCoverFrac 2 landCoverFrac 3 residualFrac 4 landCoverFrac
serialize
Serialize collection/catalog to corresponding json and csv files.
name (str) – name to use when creating ESM collection json file and csv catalog.
directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory
catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col_subset = col.search( ... source_id="BCC-ESM1", ... grid_label="gn", ... table_id="Amon", ... experiment_id="historical", ... ) >>> col_subset.serialize(name="cmip6_bcc_esm1", catalog_type="file") Writing csv catalog to: cmip6_bcc_esm1.csv.gz Writing ESM collection json file to: cmip6_bcc_esm1.json
to_dataset_dict
Load catalog entries into a dictionary of xarray datasets.
zarr_kwargs (dict) – Keyword arguments to pass to open_zarr() function
open_zarr()
cdf_kwargs (dict) – Keyword arguments to pass to open_dataset() function. If specifying chunks, the chunking is applied to each netcdf file. Therefore, chunks must refer to dimensions that are present in each netcdf file, or chunking will fail.
open_dataset()
preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.
aggregate (bool, optional) – If False, no aggregation will be done.
dsets (dict) – A dictionary of xarray Dataset.
>>> import intake >>> col = intake.open_esm_datastore("glade-cmip6.json") >>> cat = col.search( ... source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"], ... experiment_id=["historical", "ssp585"], ... variable_id="pr", ... table_id="Amon", ... grid_label="gn", ... ) >>> dsets = cat.to_dataset_dict() >>> dsets.keys() dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn']) >>> dsets["CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn"] <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
unique
Return unique values for given columns in the catalog.
columns (str, list) – name of columns for which to get unique values
info (dict) – dictionary containing count, and unique values
>>> import intake >>> import pprint >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> uniques = col.unique(columns=["activity_id", "source_id"]) >>> pprint.pprint(uniques) {'activity_id': {'count': 10, 'values': ['AerChemMIP', 'C4MIP', 'CMIP', 'DAMIP', 'DCPP', 'HighResMIP', 'LUMIP', 'OMIP', 'PMIP', 'ScenarioMIP']}, 'source_id': {'count': 17, 'values': ['BCC-ESM1', 'CNRM-ESM2-1', 'E3SM-1-0', 'MIROC6', 'HadGEM3-GC31-LL', 'MRI-ESM2-0', 'GISS-E2-1-G-CC', 'CESM2-WACCM', 'NorCPM1', 'GFDL-AM4', 'GFDL-CM4', 'NESM3', 'ECMWF-IFS-LR', 'IPSL-CM6A-ATM-HR', 'NICAM16-7S', 'GFDL-CM4C192', 'MPI-ESM1-2-HR']}}
update_aggregation
Updates aggregation operations info.
attribute_name (str) – Name of attribute (column) across which to aggregate.
agg_type (str, optional) – Type of aggregation operation to apply. Valid values include: join_new, join_existing, union, by default None
options (dict, optional) – Aggregration settings that are passed as keywords arguments to concat() or merge(). For join_existing, it must contain the name of the existing dimension to use (for e.g.: something like {‘dim’: ‘time’})., by default None
concat()
merge()
delete (bool, optional) – Whether to delete/remove/disable aggregation operations for a particular attribute, by default False
agg_columns
List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.
data_format
The data format. Valid values are netcdf and zarr. If specified, it means that all data assets in the catalog use the same data format.
df
Return pandas DataFrame.
DataFrame
format_column_name
Name of the column which contains the data format.
groupby_attrs
Dataframe columns used to determine groups of compatible datasets.
list – Columns used to determine groups of compatible datasets.
key_template
Return string template used to create catalog entry keys
str – string template used to create catalog entry keys
path_column_name
The name of the column containing the path to the asset.
variable_column_name
Name of the column that contains the variable name.