Intake-esm
Badges
Motivation
Overview
Installation
CI
Docs
Package
License
Citation
Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc…). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
Finding, investigating, loading these assets into data array containers such as xarray can be a daunting task due to the large number of files a user may be interested in. Intake-esm aims to address these issues by providing necessary functionality for searching, discovering, data access/loading.
intake-esm is a data cataloging utility built on top of intake, pandas, and xarray, and it’s pretty awesome!
intake-esm
Opening an ESM collection definition file: An ESM (Earth System Model) collection file is a JSON file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (CSV file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from). The collection JSON file can be stored on a local filesystem or can be hosted on a remote server.
In [1]: import intake In [2]: col_url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" In [3]: col = intake.open_esm_datastore(col_url) In [4]: col Out[4]: <pangeo-cmip6 catalog with 4287 dataset(s) from 282905 asset(s)>
Search and Discovery: intake-esm provides functionality to execute queries against the catalog:
In [5]: col_subset = col.search( ...: experiment_id=["historical", "ssp585"], ...: table_id="Oyr", ...: variable_id="o2", ...: grid_label="gn", ...: ) In [6]: col_subset Out[6]: <pangeo-cmip6 catalog with 18 dataset(s) from 138 asset(s)>
Access: when the user is satisfied with the results of their query, they can ask intake-esm to load data assets (netCDF/HDF files and/or Zarr stores) into xarray datasets:
In [7]: dset_dict = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True}) --> The keys in the returned dictionary of datasets are constructed as follows: 'activity_id.institution_id.source_id.experiment_id.table_id.grid_label' |███████████████████████████████████████████████████████████████| 100.00% [18/18 00:10<00:00]
See documentation for more information.
Intake-esm can be installed from PyPI with pip:
python -m pip install intake-esm
It is also available from conda-forge for conda installations:
conda-forge
conda install -c conda-forge intake-esm
If you encounter any errors or problems with intake-esm, please open an issue at the GitHub main repository.
The intake-esm user guide introduces the main concepts required for accessing Earth Sytem Model (ESM) data catalogs and loading data assets into xarray containers. This guide gives an overview of the functionality available. The guide is split into core and tutorials & examples sections.
Intake-esm is a data cataloging utility built on top of intake, pandas, and xarray. Intake-esm aims to facilitate:
the discovery of earth’s climate and weather datasets.
the ingestion of these datasets into xarray dataset containers.
It’s basic usage is shown below. To begin, let’s import intake:
intake
import intake
At import time, intake-esm plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore() function. For demonstration purposes, we are going to use the catalog for Community Earth System Model Large ensemble (CESM LENS) dataset publicly available in Amazon S3.
esm_datastore
intake.open_esm_datastore()
Note
You can learn more about CESM LENS dataset in AWS S3 here
You can load data from an ESM Catalog by providing the URL to valid ESM Catalog:
catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json" col = intake.open_esm_datastore(catalog_url) col
aws-cesm1-le catalog with 56 dataset(s) from 429 asset(s):
The summary above tells us that this catalog contains over 400 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:
col.df.head()
To get unique values for given columns in the catalog, intake-esm provides a unique() method. This method returns a dictionary containing count, and unique values:
unique()
col.unique(columns=["component", "frequency", "experiment"])
{'component': {'count': 5, 'values': ['atm', 'ice_nh', 'ice_sh', 'lnd', 'ocn']}, 'frequency': {'count': 6, 'values': ['daily', 'hourly6-1990-2005', 'hourly6-2026-2035', 'hourly6-2071-2080', 'monthly', 'static']}, 'experiment': {'count': 4, 'values': ['20C', 'CTRL', 'HIST', 'RCP85']}}
The search() method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must be the names of the columns in the catalog. The search method returns a subset of the catalog with all the entries that match the provided query.
search()
By default, the search() method looks for exact matches
col_subset = col.search( component=["ice_nh", "lnd"], frequency=["monthly"], experiment=["20C", "HIST"], ) col_subset.df
As pointed earlier, the search method looks for exact matches by default. However, with use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column:
# Find all entries with `wind` in their variable_long_name col.search(variable_long_name="wind*").df
# Find all entries whose variable long name starts with `wind` col.search(variable_long_name="^wind").df
Intake-esm implements convenience utilities for loading the query results into higher level xarray datasets. The logic for merging/concatenating the query results into higher level xarray datasets is provided in the input JSON file and is available under .aggregation_info property:
.aggregation_info
col.aggregation_info
AggregationInfo(groupby_attrs=['component', 'experiment', 'frequency'], variable_column_name='variable', aggregations=[{'type': 'union', 'attribute_name': 'variable', 'options': {'compat': 'override'}}], agg_columns=['variable'], aggregation_dict={'variable': {'type': 'union', 'options': {'compat': 'override'}}})
col.aggregation_info.aggregations
[{'type': 'union', 'attribute_name': 'variable', 'options': {'compat': 'override'}}]
# Dataframe columns used to determine groups of compatible datasets. col.aggregation_info.groupby_attrs # or col.groupby_attrs
['component', 'experiment', 'frequency']
# List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset. col.aggregation_info.agg_columns # or col.agg_columns
['variable']
To load data assets into xarray datasets, we need to use the to_dataset_dict() method. This method returns a dictionary of aggregate xarray datasets as the name hints.
to_dataset_dict()
dset_dicts = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
--> The keys in the returned dictionary of datasets are constructed as follows: 'component.experiment.frequency'
[key for key in dset_dicts.keys()]
['ice_nh.HIST.monthly', 'ice_nh.20C.monthly', 'lnd.HIST.monthly', 'lnd.20C.monthly']
We can access a particular dataset as follows:
ds = dset_dicts["lnd.20C.monthly"] print(ds)
<xarray.Dataset> Dimensions: (hist_interval: 2, lat: 192, levgrnd: 15, lon: 288, member_id: 40, time: 1032) Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 ... 88.12 89.06 90.0 * lon (lon) float32 0.0 1.25 2.5 3.75 ... 356.25 357.5 358.75 * member_id (member_id) int64 1 2 3 4 5 6 7 ... 35 101 102 103 104 105 * time (time) object 1920-01-16 12:00:00 ... 2005-12-16 12:00:00 time_bounds (time, hist_interval) object dask.array<chunksize=(1032, 2), meta=np.ndarray> * levgrnd (levgrnd) float32 0.007100635 0.027925 ... 21.32647 35.17762 Dimensions without coordinates: hist_interval Data variables: FSNO (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> H2OSNO (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> QRUNOFF (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> RAIN (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> SNOW (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> SOILLIQ (member_id, time, levgrnd, lat, lon) float32 dask.array<chunksize=(1, 40, 15, 192, 288), meta=np.ndarray> SOILWATER_10CM (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> Attributes: hostname: tcs nco_openmp_thread_number: 1 username: mudryk title: CLM History file information version: cesm1_1_1_alpha01g Conventions: CF-1.0 case_title: UNSET revision_id: $Id: histFileMod.F90 40539 2012-09-... source: Community Land Model CLM4.0 Surface_dataset: surfdata_0.9x1.25_simyr1850_c110921.nc Initial_conditions_dataset: b.e11.B20TRC5CNBDRD.f09_g16.001.clm... comment: NOTE: None of the variables are wei... PFT_physiological_constants_dataset: pft-physiology.c110425.nc NCO: 4.3.4 intake_esm_varname: FSNO\nH2OSNO\nQRUNOFF\nRAIN\nSNOW\n... intake_esm_dataset_key: lnd.20C.monthly
Let’s create a quick plot for a slice of the data:
ds.SNOW.isel(time=0, member_id=range(1, 24, 4)).plot(col="member_id", col_wrap=3, robust=True)
<xarray.plot.facetgrid.FacetGrid at 0x1657d0820>
Intake-esm provides functionality to execute queries against the catalog. This notebook provided a more in-depth treatment of the search API in intake-esm, with detailed information that you can refer to when needed.
import warnings warnings.filterwarnings("ignore") import intake
The search() method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must be the names of the columns in the catalog. By default, the search() method looks for exact matches, and is case sensitive:
col.search(experiment="20C", variable_long_name="wind")
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj) 916 method = get_real_method(obj, self.print_method) 917 if method is not None: --> 918 method() 919 return True 920 ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/core.py in _ipython_display_(self) 535 from IPython.display import HTML, display 536 --> 537 contents = self._repr_html_() 538 display(HTML(contents)) 539 ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/core.py in _repr_html_(self) 524 Mainly for IPython notebook 525 """ --> 526 uniques = pd.DataFrame(self.nunique(), columns=['unique']) 527 text = uniques._repr_html_() 528 output = f'<p><strong>{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s)</strong>:</p> {text}' ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/core.py in nunique(self) 760 """ 761 --> 762 uniques = self.unique(self.df.columns.tolist()) 763 nuniques = {} 764 for key, val in uniques.items(): ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/core.py in unique(self, columns) 818 819 """ --> 820 return _unique(self.df, columns) 821 822 def to_dataset_dict( ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/search.py in _unique(df, columns) 19 return uniques 20 ---> 21 x = df[columns].apply(_find_unique, result_type='reduce').to_dict() 22 info = {} 23 for col in x.keys(): ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds) 7763 kwds=kwds, 7764 ) -> 7765 return op.get_result() 7766 7767 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame: ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self) 177 # one axis empty 178 elif not all(self.obj.shape): --> 179 return self.apply_empty_result() 180 181 # raw ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/apply.py in apply_empty_result(self) 216 r = np.nan 217 --> 218 return self.obj._constructor_sliced(r, index=self.agg_axis) 219 else: 220 return self.obj.copy() ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath) 320 if len(index) != len(data): 321 raise ValueError( --> 322 f"Length of passed values is {len(data)}, " 323 f"index implies {len(index)}." 324 ) ValueError: Length of passed values is 0, index implies 9.
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj) 343 method = get_real_method(obj, self.print_method) 344 if method is not None: --> 345 return method() 346 return None 347 else: ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/core.py in _repr_html_(self) 524 Mainly for IPython notebook 525 """ --> 526 uniques = pd.DataFrame(self.nunique(), columns=['unique']) 527 text = uniques._repr_html_() 528 output = f'<p><strong>{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s)</strong>:</p> {text}' ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/core.py in nunique(self) 760 """ 761 --> 762 uniques = self.unique(self.df.columns.tolist()) 763 nuniques = {} 764 for key, val in uniques.items(): ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/core.py in unique(self, columns) 818 819 """ --> 820 return _unique(self.df, columns) 821 822 def to_dataset_dict( ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/intake_esm-0.0.0-py3.7.egg/intake_esm/search.py in _unique(df, columns) 19 return uniques 20 ---> 21 x = df[columns].apply(_find_unique, result_type='reduce').to_dict() 22 info = {} 23 for col in x.keys(): ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds) 7763 kwds=kwds, 7764 ) -> 7765 return op.get_result() 7766 7767 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame: ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self) 177 # one axis empty 178 elif not all(self.obj.shape): --> 179 return self.apply_empty_result() 180 181 # raw ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/apply.py in apply_empty_result(self) 216 r = np.nan 217 --> 218 return self.obj._constructor_sliced(r, index=self.agg_axis) 219 else: 220 return self.obj.copy() ~/checkouts/readthedocs.org/user_builds/intake-esm/envs/v2021.1.15/lib/python3.7/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath) 320 if len(index) != len(data): 321 raise ValueError( --> 322 f"Length of passed values is {len(data)}, " 323 f"index implies {len(index)}." 324 ) ValueError: Length of passed values is 0, index implies 9.
<aws-cesm1-le catalog with 0 dataset(s) from 0 asset(s)>
As you can see, the example above returns an empty catalog.
In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:
entries from experiment = ‘20C’
experiment
all entries whose variable long name contains wind
wind
col.search(experiment="20C", variable_long_name="wind*").df
Now, let’s search for:
all entries whose variable long name starts with wind
col.search(experiment="20C", variable_long_name="^wind").df
require_all_on argument
By default intake-esm’s search() method returns entries that fulfill any of the criteria specified in the query. Intake-esm can return entries that fulfill all query criteria when the user supplies the require_all_on argument. The require_all_on parameter can be a dataframe column or a list of dataframe columns across which all elements must satisfy the query criteria. The require_all_on argument is best explained with the following example.
require_all_on
Let’s define a query for our collection that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:
catalog_url = ( "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" ) col = intake.open_esm_datastore(catalog_url) col
pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):
# Define our query query = dict( variable_id=["thetao", "o2"], experiment_id=["historical", "ssp245", "ssp585"], table_id=["Omon"], source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"], )
Now, let’s use this query to search for all assets in the collection that satisfy any combination of these requests (i.e., with require_all_on=None, which is the default):
require_all_on=None
col_subset = col.search(**query) col_subset
pangeo-cmip6 catalog with 8 dataset(s) from 76 asset(s):
# Group by `source_id` and count unique values for a few columns col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the three experiments.
We can tell intake-esm to discard any source_id that doesn’t have both variables ["thetao", "o2"] and all three experiments ["historical", "ssp245", "ssp585"] by passing require_all_on=["source_id"] to the search method:
["thetao", "o2"]
["historical", "ssp245", "ssp585"]
require_all_on=["source_id"]
col_subset = col.search(require_all_on=["source_id"], **query) col_subset
pangeo-cmip6 catalog with 3 dataset(s) from 63 asset(s):
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
Notice that with the require_all_on=["source_id"] option, the only source_id that was returned by our query was the source_id for which all of the variables and experiments were found.
In addition to catalogs of data assets (files) in time-series (single-variable) format, intake-esm supports catalogs with data assets in time-slice (history) format and/or files with multiple variables. For intake-esm to properly work with multi-variable assets,
the variable_column of the catalog must contain iterables (list, tuple, set) of values.
variable_column
the user must specifiy a dictionary of functions for converting values in certain columns into iterables. This is done via the csv_kwargs argument.
csv_kwargs
In the example below, we are are going to use the following catalog to demonstrate how to work with multi-variable assets:
# Look at the catalog on disk !cat multi-variable-catalog.csv
experiment,case,component,stream,variable,member_id,path,time_range CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050001-050012.nc,050001-050012 CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050101-050112.nc,050101-050112 CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050001-050012.nc,050001-050012 CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050101-050112.nc,050101-050112 CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'TEMP', 'SiO3']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-TEMP-SiO3.050001-050012.nc,050001-050012
As you can see, the variable column contains a list of varibles, and this list was serialized as a string: "['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']".
"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']"
To load a catalog with multiple variable files, we must pass additional information to open_esm_datastore via the csv_kwargs argument. We are going to specify a dictionary of functions for converting values in variable column into iterables. We use the literal_eval function from the standard ast module:
open_esm_datastore
variable
literal_eval
ast
import ast import intake
col = intake.open_esm_datastore( "multi-variable-collection.json", csv_kwargs={"converters": {"variable": ast.literal_eval}}, ) col
sample-multi-variable-cesm1-lens catalog with 1 dataset(s) from 5 asset(s):
The in-memory representation of the catalog contains variable with tuple of values. To confirm that intake-esm has registered this catalog with multiple variable assets, we can the ._multiple_variable_assets property:
._multiple_variable_assets
col._multiple_variable_assets
True
The search functionatilty works in the same way:
col_subset = col.search(variable=["O2", "SiO3"]) col_subset.df
Loading data assets into xarray datasets works in the same way too:
col_subset.to_dataset_dict(cdf_kwargs={})
--> The keys in the returned dictionary of datasets are constructed as follows: 'component.experiment.stream'
{'ocn.CTRL.pop.h': <xarray.Dataset> Dimensions: (member_id: 1, nlat: 2, nlon: 2, time: 24) Coordinates: * time (time) object 0500-02-01 00:00:00 ... 0502-02-01 00:00:00 TLAT (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray> TLONG (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray> ULAT (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray> ULONG (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray> * member_id (member_id) int64 5 Dimensions without coordinates: nlat, nlon Data variables: O2 (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray> SiO3 (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 24, 2, 2), meta=np.ndarray> Attributes: cell_methods: cell_methods = time: mean ==> the variable val... history: Fri Oct 11 01:05:51 2013: /glade/apps/opt/nco/... NCO: 4.3.4 nsteps_total: 1953500 source: CCSM POP2, the CCSM Ocean Component intake_esm_varname: O2\nSiO3 title: b.e11.B1850C5CN.f09_g16.005 revision: $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy... tavg_sum_qflux: 2678400.0 calendar: All years have exactly 365 days. start_time: This dataset was created on 2013-05-28 at 02:4... Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc... tavg_sum: 2678400.0 contents: Diagnostic and Prognostic Variables nco_openmp_thread_number: 1 intake_esm_dataset_key: ocn.CTRL.pop.h}
This notebook demonstrates how to access Google Cloud CMIP6 data using intake-esm.
url = ( "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" ) col = intake.open_esm_datastore(url) col
The summary above tells us that this catalog contains over 268,000 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:
The first data asset listed in the catalog contains:
the ambient aerosol optical thickness at 550nm (variable_id='od550aer'), as a function of latitude, longitude, time,
variable_id='od550aer'
in an individual climate model experiment with the Taiwan Earth System Model 1.0 model (source_id='TaiESM1'),
source_id='TaiESM1'
forced by the Historical transient with SSTs prescribed from historical experiment (experiment_id='histSST'),
experiment_id='histSST'
developed by the Taiwan Research Center for Environmental Changes (instution_id='AS-RCEC'),
instution_id='AS-RCEC'
run as part of the Aerosols and Chemistry Model Intercomparison Project (activity_id='AerChemMIP')
activity_id='AerChemMIP'
And is located in Google Cloud Storage at gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/.
gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/
Let’s query the data to see what models (source_id), experiments (experiment_id) and temporal frequencies (table_id) are available.
source_id
experiment_id
table_id
import pprint uni_dict = col.unique(["source_id", "experiment_id", "table_id"]) pprint.pprint(uni_dict, compact=True)
{'experiment_id': {'count': 160, 'values': ['ssp245-stratO3', '1pctCO2-cdr', 'pdSST-piAntSIC', 'pdSST-piArcSIC', 'piClim-HC', 'rcp85-cmip5', 'ssp126-ssp370Lu', 'amip-m4K', 'faf-heat-NA0pct', 'hist-piAer', 'rcp26-cmip5', 'ssp370SST', 'histSST-piO3', 'rcp45-cmip5', 'ssp370SST-ssp126Lu', 'historical-cmip5', 'lig127k', 'piClim-SO2', 'ssp245', 'historical-ext', 'piClim-BC', 'hist-sol', 'piClim-VOC', 'piClim-control', 'hist-aer', 'hist-noLu', 'aqua-control', 'esm-pi-CO2pulse', 'faf-passiveheat', 'piClim-2xDMS', 'dcppC-hindcast-noAgung', 'ssp245-nat', 'abrupt-solm4p', 'ssp370-ssp126Lu', 'amip-4xCO2', 'piClim-anthro', 'dcppC-pac-pacemaker', 'dcppC-amv-pos', 'ssp534-over', 'abrupt-2xCO2', 'piClim-NOx', 'deforest-globe', 'ssp119', 'dcppC-amv-ExTrop-pos', 'ssp370', 'pa-pdSIC', 'abrupt-0p5xCO2', 'piClim-OC', 'aqua-p4K', 'amip-hist', 'piClim-NTCF', 'ssp460', 'amip-future4K', 'pdSST-futArcSIC', 'hist-CO2', 'piControl-cmip5', 'piClim-histall', 'hist-bgc', 'hist-resIPO', 'dcppC-ipv-NexTrop-neg', 'ssp245-cov-fossil', 'land-hist', 'piClim-histaer', 'hist-1950HC', 'piSST-pdSIC', 'hist-GHG-cmip5', 'control-1950', 'past1000', '1pctCO2-rad', 'faf-stress', 'ssp370pdSST', 'dcppC-pac-control', 'futSST-pdSIC', 'pdSST-pdSIC', 'piClim-2xfire', 'dcppC-atl-pacemaker', 'highresSST-future', 'hist-piNTCF', 'lgm', 'dcppC-amv-ExTrop-neg', 'piClim-2xVOC', 'piClim-ghg', 'piClim-histnat', 'histSST-piAer', 'piControl-spinup', 'highresSST-present', 'piClim-N2O', 'faf-heat-NA50pct', 'ssp585', 'piClim-O3', 'land-noLu', 'abrupt-4xCO2', 'histSST', 'hist-1950', 'dcppC-amv-neg', 'piControl', 'esm-pi-cdr-pulse', 'historical', 'piClim-lu', 'dcppC-atl-control', 'amip-p4K', 'esm-piControl-spinup', 'land-hist-altStartYear', 'amip-p4K-lwoff', 'hist-nat-cmip5', 'histSST-piCH4', 'esm-ssp585', 'omip1', 'ssp370SST-lowNTCF', 'histSST-1950HC', 'piClim-2xdust', 'piClim-2xNOx', 'ssp245-cov-strgreen', 'faf-heat', 'dcppC-amv-Trop-pos', 'histSST-piNTCF', 'esm-piControl', 'hist-stratO3', 'piClim-4xCO2', 'dcppC-ipv-NexTrop-pos', 'aqua-4xCO2', 'hist-totalO3', 'ssp434', 'ssp370SST-lowCH4', 'pdSST-futAntSIC', 'dcppC-amv-Trop-neg', 'dcppC-ipv-pos', '1pctCO2', 'abrupt-solp4p', 'piSST-piSIC', 'ssp245-aer', 'piClim-aer', 'aqua-control-lwoff', 'ssp245-GHG', 'piClim-histghg', 'hist-aer-cmip5', 'piClim-2xss', 'hist-volc', 'piClim-CH4', '1pctCO2-bgc', 'dcppC-ipv-neg', 'amip', 'amip-lwoff', 'pa-futArcSIC', 'dcppC-hindcast-noPinatubo', 'ssp245-covid', 'dcppC-hindcast-noElChichon', 'ssp370-lowNTCF', 'esm-hist', 'ssp126', 'hist-nat', 'dcppA-hindcast', 'aqua-p4K-lwoff', 'ssp245-cov-modgreen', 'esm-ssp585-ssp126Lu', 'dcppA-assim', 'faf-all', 'hist-GHG', 'midHolocene', 'faf-water']}, 'source_id': {'count': 84, 'values': ['UKESM1-0-LL', 'EC-Earth3-Veg', 'SAM0-UNICON', 'INM-CM5-H', 'MPI-ESM1-2-HR', 'CESM1-1-CAM5-CMIP5', 'MIROC6', 'EC-Earth3-Veg-LR', 'GFDL-CM4C192', 'BCC-CSM2-HR', 'NorESM2-LM', 'NorESM2-MM', 'NorCPM1', 'CIESM', 'IITM-ESM', 'CESM2-WACCM', 'IPSL-CM6A-ATM-HR', 'CESM2-WACCM-FV2', 'GFDL-CM4', 'ECMWF-IFS-LR', 'IPSL-CM5A2-INCA', 'E3SM-1-0', 'KACE-1-0-G', 'CNRM-CM6-1', 'GFDL-ESM2M', 'MPI-ESM1-2-XR', 'BCC-ESM1', 'CanESM5-CanOE', 'EC-Earth3P-VHR', 'GFDL-ESM4', 'GFDL-AM4', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'CAS-ESM2-0', 'EC-Earth3-LR', 'AWI-CM-1-1-MR', 'IPSL-CM6A-LR-INCA', 'HadGEM3-GC31-LL', 'CNRM-CM6-1-HR', 'EC-Earth3P-HR', 'MRI-AGCM3-2-H', 'ECMWF-IFS-HR', 'GISS-E2-1-H', 'EC-Earth3-AerChem', 'CAMS-CSM1-0', 'MIROC-ES2L', 'HadGEM3-GC31-HM', 'IPSL-CM6A-LR', 'CMCC-CM2-HR4', 'FGOALS-g3', 'HadGEM3-GC31-MM', 'MPI-ESM-1-2-HAM', 'MPI-ESM1-2-LR', 'MRI-AGCM3-2-S', 'INM-CM4-8', 'CMCC-CM2-VHR4', 'HadGEM3-GC31-LM', 'EC-Earth3-CC', 'TaiESM1', 'FGOALS-f3-H', 'E3SM-1-1', 'KIOST-ESM', 'CanESM5', 'GISS-E2-1-G', 'ACCESS-ESM1-5', 'CMCC-ESM2', 'GISS-E2-1-G-CC', 'CESM2', 'MRI-ESM2-0', 'CMCC-CM2-SR5', 'GFDL-OM4p5B', 'INM-CM5-0', 'NESM3', 'CESM2-FV2', 'MCM-UA-1-0', 'NorESM1-F', 'FIO-ESM-2-0', 'EC-Earth3P', 'ACCESS-CM2', 'E3SM-1-1-ECA', 'CNRM-ESM2-1', 'EC-Earth3', 'GISS-E2-2-G', 'FGOALS-f3-L']}, 'table_id': {'count': 37, 'values': ['6hrLev', 'Omon', '6hrPlevPt', 'CFmon', 'day', 'SIclim', 'ImonGre', 'Odec', 'Oday', 'AERmonZ', 'Efx', 'AERhr', 'Lmon', 'SIday', 'Aclim', 'E3hr', 'E1hrClimMon', 'EdayZ', 'Oyr', 'EmonZ', 'Ofx', 'Eyr', '3hr', 'CFday', 'Emon', 'Eday', 'fx', 'SImon', 'Oclim', 'AERmon', 'CF3hr', 'IfxGre', 'LImon', 'Eclim', 'AERday', '6hrPlev', 'Amon']}}
In the example below, we are are going to search for the following:
variables: o2 which stands for mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
o2
mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
experiments: ['historical', 'ssp585']:
['historical', 'ssp585']
historical: all forcing of the recent past.
historical
ssp585: emission-driven RCP8.5 based on SSP5.
ssp585
table_id: Oyr which stands for annual mean variables on the ocean grid.
Oyr
grid_label: gn which stands for data reported on a model’s native grid.
gn
For more details on the CMIP6 vocabulary, please check this website, and Core Controlled Vocabularies (CVs) for use in CMIP6 GitHub repository.
cat = col.search( experiment_id=["historical", "ssp585"], table_id="Oyr", variable_id="o2", grid_label="gn", ) cat
pangeo-cmip6 catalog with 23 dataset(s) from 149 asset(s):
cat.df.head()
dset_dict = cat.to_dataset_dict( zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True} )
--> The keys in the returned dictionary of datasets are constructed as follows: 'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
[key for key in dset_dict.keys()]
['ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp585.Oyr.gn', 'CMIP.NCC.NorESM2-LM.historical.Oyr.gn', 'ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp585.Oyr.gn', 'CMIP.MIROC.MIROC-ES2L.historical.Oyr.gn', 'ScenarioMIP.NCAR.CESM2.ssp585.Oyr.gn', 'ScenarioMIP.NCC.NorESM2-LM.ssp585.Oyr.gn', 'CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn', 'CMIP.MRI.MRI-ESM2-0.historical.Oyr.gn', 'CMIP.MPI-M.MPI-ESM1-2-HR.historical.Oyr.gn', 'ScenarioMIP.NCC.NorESM2-MM.ssp585.Oyr.gn', 'CMIP.CCCma.CanESM5-CanOE.historical.Oyr.gn', 'ScenarioMIP.CCCma.CanESM5-CanOE.ssp585.Oyr.gn', 'ScenarioMIP.DWD.MPI-ESM1-2-HR.ssp585.Oyr.gn', 'CMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.historical.Oyr.gn', 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Oyr.gn', 'CMIP.MPI-M.MPI-ESM1-2-LR.historical.Oyr.gn', 'CMIP.CSIRO.ACCESS-ESM1-5.historical.Oyr.gn', 'CMIP.NCC.NorESM2-MM.historical.Oyr.gn', 'ScenarioMIP.CCCma.CanESM5.ssp585.Oyr.gn', 'ScenarioMIP.MIROC.MIROC-ES2L.ssp585.Oyr.gn', 'ScenarioMIP.MRI.MRI-ESM2-0.ssp585.Oyr.gn', 'CMIP.CCCma.CanESM5.historical.Oyr.gn', 'ScenarioMIP.MPI-M.MPI-ESM1-2-LR.ssp585.Oyr.gn']
ds = dset_dict["CMIP.CCCma.CanESM5.historical.Oyr.gn"] print(ds)
<xarray.Dataset> Dimensions: (bnds: 2, i: 360, j: 291, lev: 45, member_id: 35, time: 165, vertices: 4) Coordinates: * i (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359 * j (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290 latitude (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray> * lev (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03 lev_bnds (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray> longitude (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray> * time (time) object 1850-07-02 12:00:00 ... 2014-07-02 12:0... time_bnds (time, bnds) object dask.array<chunksize=(165, 2), meta=np.ndarray> * member_id (member_id) <U9 'r10i1p1f1' 'r10i1p2f1' ... 'r9i1p2f1' Dimensions without coordinates: bnds, vertices Data variables: o2 (member_id, time, lev, j, i) float32 dask.array<chunksize=(1, 12, 45, 291, 360), meta=np.ndarray> vertices_latitude (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray> vertices_longitude (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray> Attributes: parent_activity_id: CMIP branch_method: Spin-up documentation parent_experiment_id: piControl parent_time_units: days since 1850-01-01 0:0:0.0 cmor_version: 3.4.0 mip_era: CMIP6 variant_label: r9i1p2f1 grid: ORCA1 tripolar grid, 1 deg with refinement t... source_id: CanESM5 Conventions: CF-1.7 CMIP-6.2 frequency: yr sub_experiment_id: none history: 2019-05-02T13:53:53Z ;rewrote data to be con... forcing_index: 1 institution: Canadian Centre for Climate Modelling and An... product: model-output branch_time_in_child: 0.0 status: 2019-10-25;created;by nhn2@columbia.edu YMDH_branch_time_in_child: 1850:01:01:00 realm: ocnBgchem references: Geophysical Model Development Special issue ... grid_label: gn license: CMIP6 model data produced by The Government ... source_type: AOGCM title: CanESM5 output prepared for CMIP6 YMDH_branch_time_in_parent: 5950:01:01:00 further_info_url: https://furtherinfo.es-doc.org/CMIP6.CCCma.C... experiment_id: historical data_specs_version: 01.00.29 table_info: Creation Date:(20 February 2019) MD5:374fbe5... CCCma_model_hash: Unknown activity_id: CMIP realization_index: 9 initialization_index: 1 source: CanESM5 (2019): \naerosol: interactive\natmo... parent_source_id: CanESM5 sub_experiment: none CCCma_parent_runid: p2-pictrl tracking_id: hdl:21.14100/41426118-701c-482b-ae16-82932e4... creation_date: 2019-05-30T08:58:45Z experiment: all-forcing simulation of the recent past CCCma_runid: p2-his09 institution_id: CCCma variable_id: o2 contact: ec.cccma.info-info.ccmac.ec@canada.ca table_id: Oyr branch_time_in_parent: 1496500.0 nominal_resolution: 100 km parent_mip_era: CMIP6 intake_esm_varname: ['o2'] version: v20190429 external_variables: areacello volcello intake_esm_dataset_key: CMIP.CCCma.CanESM5.historical.Oyr.gn
ds.o2.isel(time=0, lev=0, member_id=range(1, 24, 4)).plot(col="member_id", col_wrap=3, robust=True)
<xarray.plot.facetgrid.FacetGrid at 0x7f99eb8a4910>
When comparing many models it is often necessary to preprocess (e.g. rename certain variables) them before running some analysis step. The preprocess argument lets the user pass a function, which is executed for each loaded asset before aggregations.
preprocess
cat_pp = col.search( experiment_id=["historical"], table_id="Oyr", variable_id="o2", grid_label="gn", source_id=["IPSL-CM6A-LR", "CanESM5"], member_id="r10i1p1f1", ) cat_pp.df
# load the example dset_dict_raw = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True})
for k, ds in dset_dict_raw.items(): print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn dimensions=['axis_nbounds', 'member_id', 'nvertex', 'olevel', 'time', 'x', 'y'] dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn dimensions=['bnds', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']
Note that both models follow a different naming scheme. We can define a little helper function and pass it to .to_dataset_dict() to fix this. For demonstration purposes we will focus on the vertical level dimension which is called lev in CanESM5 and olevel in IPSL-CM6A-LR.
.to_dataset_dict()
lev
CanESM5
olevel
IPSL-CM6A-LR
def helper_func(ds): """Rename `olevel` dim to `lev`""" ds = ds.copy() # a short example if "olevel" in ds.dims: ds = ds.rename({"olevel": "lev"}) return ds
dset_dict_fixed = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True}, preprocess=helper_func)
for k, ds in dset_dict_fixed.items(): print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn dimensions=['axis_nbounds', 'lev', 'member_id', 'nvertex', 'time', 'x', 'y'] dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn dimensions=['bnds', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']
This was just an example for one dimension.
Check out cmip6-preprocessing package for a full renaming function for all available CMIP6 models and some other utilities.
The in-memory representation of an Earth System Model (ESM) catalog is a pandas dataframe, and is accessible via the .df property:
.df
url = ( "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" ) col = intake.open_esm_datastore(url) col.df.head()
In this notebook we will go through some examples showing how to manipulate this dataframe outside of intake-esm.
Let’s say we are interested in datasets with the following attributes:
experiment_id=["historical"]
table_id="Amon"
variable_id="tas"
source_id=['TaiESM1', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'UKESM1-0-LL']
In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.
This can be achieved in two steps:
We can run a query against the catalog:
col_subset = col.search( experiment_id=["historical"], table_id="Amon", variable_id="tas", source_id=[ "TaiESM1", "AWI-CM-1-1-MR", "AWI-ESM-1-1-LR", "BCC-CSM2-MR", "BCC-ESM1", "CAMS-CSM1-0", "CAS-ESM2-0", "UKESM1-0-LL", ], ) col_subset
pangeo-cmip6 catalog with 9 dataset(s) from 38 asset(s):
member_id
The subsetted catalog contains source_id with the following number of member_id per source_id:
col_subset.df.groupby("source_id")["member_id"].nunique()
source_id AWI-CM-1-1-MR 5 AWI-ESM-1-1-LR 1 BCC-CSM2-MR 3 BCC-ESM1 3 CAMS-CSM1-0 3 CAS-ESM2-0 4 TaiESM1 1 UKESM1-0-LL 18 Name: member_id, dtype: int64
To get the first member_id for each source_id, we group the dataframe by source_id and use the .first() function to retrieve the first member_id:
.first()
grouped = col_subset.df.groupby(["source_id"]) df = grouped.first().reset_index() # Confirm that we have one ensemble member per source_id df.groupby("source_id")["member_id"].nunique()
source_id AWI-CM-1-1-MR 1 AWI-ESM-1-1-LR 1 BCC-CSM2-MR 1 BCC-ESM1 1 CAMS-CSM1-0 1 CAS-ESM2-0 1 TaiESM1 1 UKESM1-0-LL 1 Name: member_id, dtype: int64
col_subset.df = df col_subset
pangeo-cmip6 catalog with 8 dataset(s) from 8 asset(s):
dsets = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True}) [key for key in dsets]
['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'CMIP.AWI.AWI-CM-1-1-MR.historical.Amon.gn', 'CMIP.CAS.CAS-ESM2-0.historical.Amon.gn', 'CMIP.AWI.AWI-ESM-1-1-LR.historical.Amon.gn', 'CMIP.AS-RCEC.TaiESM1.historical.Amon.gn', 'CMIP.MOHC.UKESM1-0-LL.historical.Amon.gn', 'CMIP.BCC.BCC-ESM1.historical.Amon.gn', 'CMIP.CAMS.CAMS-CSM1-0.historical.Amon.gn']
print(dsets["CMIP.CAS.CAS-ESM2-0.historical.Amon.gn"])
<xarray.Dataset> Dimensions: (bnds: 2, lat: 128, lon: 256, member_id: 1, time: 1980) Coordinates: height float64 ... * lat (lat) float64 -90.0 -88.58 -87.17 -85.75 ... 87.17 88.58 90.0 lat_bnds (lat, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray> * lon (lon) float64 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6 lon_bnds (lon, bnds) float64 dask.array<chunksize=(256, 2), meta=np.ndarray> * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> * member_id (member_id) <U8 'r1i1p1f1' Dimensions without coordinates: bnds Data variables: tas (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 381, 128, 256), meta=np.ndarray> Attributes: Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 branch_time_in_parent: 0.0 cmor_version: 3.5.0 contact: Zhang He (zhanghe@mail.iap.ac.cn) creation_date: 2020-03-02T12:28:26Z data_specs_version: 01.00.31 experiment: all-forcing simulation of the recent past experiment_id: historical external_variables: areacella forcing_index: 1 frequency: mon further_info_url: https://furtherinfo.es-doc.org/CMIP6.CAS.CAS-ESM... grid: native atmosphere regular grid (128x256 latxlon) grid_label: gn history: 2020-03-02T12:28:26Z ;rewrote data to be consist... initialization_index: 1 institution: Chinese Academy of Sciences, Beijing 100029, China institution_id: CAS license: CMIP6 model data produced by Institute of Atmosp... mip_era: CMIP6 nominal_resolution: 100 km parent_activity_id: CMIP parent_experiment_id: piControl parent_mip_era: CMIP6 parent_source_id: CAS-ESM2-0 parent_time_units: days since 1850-01-01 parent_variant_label: r1i1p1f1 physics_index: 1 product: model-output realization_index: 1 realm: atmos run_variant: 3rd realization source: CAS-ESM 2.0 (2019): \naerosol: IAP AACM\natmos: ... source_id: CAS-ESM2-0 source_type: AOGCM status: 2020-05-02;created; by gcs.cmip6.ldeo@gmail.com sub_experiment: none sub_experiment_id: none table_id: Amon table_info: Creation Date:(24 July 2019) MD5:b9834a2d0728c0d... title: CAS-ESM2-0 output prepared for CMIP6 tracking_id: hdl:21.14100/22e89a1b-f73e-45be-84dc-7d0aabbeea9d variable_id: tas variant_label: r1i1p1f1 intake_esm_varname: ['tas'] intake_esm_dataset_key: CMIP.CAS.CAS-ESM2-0.historical.Amon.gn
Intake-esm catalogs include two pieces:
An ESM-Collection file: an ESM-Collection file is a simple json file that provides metadata about the catalog. The specification for this json file is found in the esm-collection-spec repository.
A catalog file: the catalog file is a CSV file that lists the catalog contents. This file includes one row per dataset granule (e.g. a NetCDF file or Zarr dataset). The columns in this CSV must match the attributes and assets listed in the ESM-Collection file. A short example of a catalog file is shown below::
activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,ch4,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/ch4/gn/, AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,clt,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/clt/gn/, AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,co2,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/co2/gn/, AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,evspsbl,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/evspsbl/gn/, ...
The table below is an incomplete list of existing catalogs. Please feel free to add to this list or raise an issue on GitHub.
CMIP6-GLADE
Description: CMIP6 data accessible on the NCAR’s GLADE disk storage system
Platform: NCAR-GLADE
Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json
Data Format: netCDF
Documentation Page: https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html
CMIP6-CESM2-Timeseries
Description: CESM2 raw output (non-cmorized) that went into CMIP6 data
Platform: NCAR-CAMPAIGN
Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json
Documentation Page: http://www.cesm.ucar.edu/models/cesm2/
CMIP5-GLADE
Description: CMIP5 data accessible on the NCAR’s GLADE disk storage system
Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip5.json
Documentation Page: https://pcmdi.llnl.gov/mips/cmip5/guide.html
CESM1-LENS-AWS
Description: CESM1 Large Ensemble data publicly available on Amazon S3
Platform: AWS S3 (us-west-2 region)
Catalog path or url: https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json
Data Format: Zarr
Documentation Page: https://doi.org/10.26024/wt24-5j82
CESM1-LENS-GLADE
Description: CESM1 Large Ensemble data stored on NCAR’s GLADE disk storage system
Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm1-le.json
Documentation Page: https://doi.org/10.5065/d6j101d1
CMIP6-GCP
Description: CMIP6 Zarr data residing in Pangeo’s Google Storage
Platform: Google Cloud Platform
Catalog path or url: https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json
CMIP6-MISTRAL
Description: CMIP6 data accessible on the DKRZ’s MISTRAL disk storage system
Platform: DKRZ (German Climate Computing Centre)-MISTRAL
Catalog path or url: /work/ik1017/Catalogs/mistral-cmip6.json
CMIP5-MISTRAL
Description: CMIP5 data accessible on the DKRZ’s MISTRAL disk storage system
Catalog path or url: /work/ik1017/Catalogs/mistral-cmip5.json
MiKlip-MISTRAL
Description: Data from MiKlip projects at the Max Planck Institute for Meteorology (MPI-M)
Catalog path or url: /work/ik1017/Catalogs/mistral-miklip.json
Documentation Page: https://www.fona-miklip.de/
MPI-GE-MISTRAL
Description: Max Planck Institute Grand Ensemble cmorized by CMIP5-standards
Catalog path or url: /work/ik1017/Catalogs/mistral-MPI-GE.json
Documentation Page: https://doi.org/10/gf3kgt
CMIP6-LDEO-OpenDAP
Description: CMIP6 data accessible via Hyrax OpenDAP Server at Lamont-Doherty Earth Observatory
Platform: LDEO-OpenDAP
Catalog path or url: http://haden.ldeo.columbia.edu/catalogs/hyrax_cmip6.json
Some of these catalogs are also stored in intake-esm-datastore GitHub repository at https://github.com/NCAR/intake-esm-datastore/tree/master/catalogs
NCAR’s CMIP Analysis Platform (CMIP AP) includes a large collection of CMIP5 and CMIP6 data sets.
Use this form to request new data be added to the CMIP AP. Typically requests are fulfilled within two weeks. Contact CISL if you have further questions. Intake-ESM catalogs are regularly updated following the addition (or removal) of data from the platform.
NCAR has created multiple Intake ESM catalogs that work on datasets stored on GLADE. Those catalogs are listed below:
This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
intake.open_esm_datastore
intake_esm.core.
An intake plugin for parsing an ESM (Earth System Model) Collection/catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.
esmcol_obj (str, pandas.DataFrame) – If string, this must be a path or URL to an ESM collection JSON file. If pandas.DataFrame, this must be the catalog content that would otherwise be in a CSV file.
esmcol_data (dict, optional) – ESM collection spec information, by default None
progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True
Dataset
sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’
csv_kwargs (dict, optional) – Additional keyword arguments passed through to the read_csv() function.
read_csv()
**kwargs – Additional keyword arguments are passed through to the Catalog base class.
Catalog
Examples
At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():
>>> import intake >>> url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" >>> col = intake.open_esm_datastore(url) >>> col.df.head() activity_id institution_id source_id experiment_id ... variable_id grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ssp370 ... pr gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ssp370 ... prsn gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ssp370 ... tas gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 3 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmax gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 4 AerChemMIP BCC BCC-ESM1 ssp370 ... tasmin gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
from_df
Create catalog from the given dataframe
df (pandas.DataFrame) – catalog content that would otherwise be in a CSV file.
esm_datastore – Catalog object
keys
Get keys for the catalog entries
list – keys for the catalog entries
nunique
Count distinct observations across dataframe columns in the catalog.
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col.nunique() activity_id 10 institution_id 23 source_id 48 experiment_id 29 member_id 86 table_id 19 variable_id 187 grid_label 7 zstore 27437 dcpp_init_year 59 dtype: int64
search
Search for entries in the catalog.
require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.
**query – keyword arguments corresponding to user’s query to execute against the dataframe.
cat (esm_datastore) – A new Catalog with a subset of the entries in this Catalog.
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 0 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 1 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN 2 AerChemMIP BCC BCC-ESM1 ... gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1... NaN
>>> cat = col.search( ... source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"], ... experiment_id=["historical", "ssp585"], ... variable_id="pr", ... table_id="Amon", ... grid_label="gn", ... ) >>> cat.df.head(3) activity_id institution_id source_id ... grid_label zstore dcpp_init_year 260 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i... NaN 346 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i... NaN 401 CMIP BCC BCC-CSM2-MR ... gn gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i... NaN
The search method also accepts compiled regular expression objects from compile() as patterns.
compile()
>>> import re >>> # Let's search for variables containing "Frac" in their name >>> pat = re.compile(r"Frac") # Define a regular expression >>> cat.search(variable_id=pat) >>> cat.df.head().variable_id 0 residualFrac 1 landCoverFrac 2 landCoverFrac 3 residualFrac 4 landCoverFrac
serialize
Serialize collection/catalog to corresponding json and csv files.
name (str) – name to use when creating ESM collection json file and csv catalog.
directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory
catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.
Notes
Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.
>>> import intake >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> col_subset = col.search( ... source_id="BCC-ESM1", ... grid_label="gn", ... table_id="Amon", ... experiment_id="historical", ... ) >>> col_subset.serialize(name="cmip6_bcc_esm1", catalog_type="file") Writing csv catalog to: cmip6_bcc_esm1.csv.gz Writing ESM collection json file to: cmip6_bcc_esm1.json
to_dataset_dict
Load catalog entries into a dictionary of xarray datasets.
zarr_kwargs (dict) – Keyword arguments to pass to open_zarr() function
open_zarr()
cdf_kwargs (dict) – Keyword arguments to pass to open_dataset() function. If specifying chunks, the chunking is applied to each netcdf file. Therefore, chunks must refer to dimensions that are present in each netcdf file, or chunking will fail.
open_dataset()
preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.
storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.
progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.
aggregate (bool, optional) – If False, no aggregation will be done.
dsets (dict) – A dictionary of xarray Dataset.
>>> import intake >>> col = intake.open_esm_datastore("glade-cmip6.json") >>> cat = col.search( ... source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"], ... experiment_id=["historical", "ssp585"], ... variable_id="pr", ... table_id="Amon", ... grid_label="gn", ... ) >>> dsets = cat.to_dataset_dict() >>> dsets.keys() dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn']) >>> dsets["CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn"] <xarray.Dataset> Dimensions: (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980) Coordinates: * lon (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9 * lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14 * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * member_id (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> pr (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
unique
Return unique values for given columns in the catalog.
columns (str, list) – name of columns for which to get unique values
info (dict) – dictionary containing count, and unique values
>>> import intake >>> import pprint >>> col = intake.open_esm_datastore("pangeo-cmip6.json") >>> uniques = col.unique(columns=["activity_id", "source_id"]) >>> pprint.pprint(uniques) {'activity_id': {'count': 10, 'values': ['AerChemMIP', 'C4MIP', 'CMIP', 'DAMIP', 'DCPP', 'HighResMIP', 'LUMIP', 'OMIP', 'PMIP', 'ScenarioMIP']}, 'source_id': {'count': 17, 'values': ['BCC-ESM1', 'CNRM-ESM2-1', 'E3SM-1-0', 'MIROC6', 'HadGEM3-GC31-LL', 'MRI-ESM2-0', 'GISS-E2-1-G-CC', 'CESM2-WACCM', 'NorCPM1', 'GFDL-AM4', 'GFDL-CM4', 'NESM3', 'ECMWF-IFS-LR', 'IPSL-CM6A-ATM-HR', 'NICAM16-7S', 'GFDL-CM4C192', 'MPI-ESM1-2-HR']}}
update_aggregation
Updates aggregation operations info.
attribute_name (str) – Name of attribute (column) across which to aggregate.
agg_type (str, optional) – Type of aggregation operation to apply. Valid values include: join_new, join_existing, union, by default None
options (dict, optional) – Aggregration settings that are passed as keywords arguments to concat() or merge(). For join_existing, it must contain the name of the existing dimension to use (for e.g.: something like {‘dim’: ‘time’})., by default None
concat()
merge()
delete (bool, optional) – Whether to delete/remove/disable aggregation operations for a particular attribute, by default False
agg_columns
List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.
data_format
The data format. Valid values are netcdf and zarr. If specified, it means that all data assets in the catalog use the same data format.
df
Return pandas DataFrame.
DataFrame
format_column_name
Name of the column which contains the data format.
groupby_attrs
Dataframe columns used to determine groups of compatible datasets.
list – Columns used to determine groups of compatible datasets.
key_template
Return string template used to create catalog entry keys
str – string template used to create catalog entry keys
path_column_name
The name of the column containing the path to the asset.
variable_column_name
Name of the column that contains the variable name.
Contribution Guide
Feature requests and feedback
Report bugs
Fix bugs
Write documentation
Preparing Pull Requests
Interested in helping build intake-esm? Have code from your work that you believe others will find useful? Have a few minutes to tackle an issue?
Contributions are highly welcomed and appreciated. Every little help counts, so do not hesitate!
The following sections cover some general guidelines regarding development in intake-esm for maintainers and contributors. Nothing here is set in stone and can’t be changed. Feel free to suggest improvements or changes in the workflow.
We’d also like to hear about your propositions and suggestions. Feel free to submit them as issues on intake-esm’s GitHub issue tracker and:
Explain in detail how they should work.
Keep the scope as narrow as possible. This will make it easier to implement.
Report bugs for intake-esm in the issue tracker.
If you are reporting a bug, please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting, specifically the Python interpreter version, installed libraries, and intake-esm version.
Detailed steps to reproduce the bug.
If you can write a demonstration test that currently fails but should pass (xfail), that is a very useful commit to make as well, even if you cannot fix the bug itself.
Look through the GitHub issues for bugs.
Talk to developers to find out how you can fix specific bugs.
intake-esm could always use more documentation. What exactly is needed?
More complementary documentation. Have you perhaps found something unclear?
Docstrings. There can never be too many of them.
Blog posts, articles and such – they’re all very appreciated.
You can also edit documentation files directly in the GitHub web interface, without using a local copy. This can be convenient for small fixes.
Build the documentation locally with the following command:
$ make docs
Fork the intake-esm GitHub repository.
Clone your fork locally using git, connect your repository to the upstream (main project), and create a branch::
$ git clone git@github.com:YOUR_GITHUB_USERNAME/intake-esm.git $ cd intake-esm $ git remote add upstream git@github.com:intake/intake-esm.git
now, to fix a bug or add feature create your own branch off “master”:
$ git checkout -b your-bugfix-feature-branch-name master
If you need some help with Git, follow this quick start guide: https://git.wiki.kernel.org/index.php/QuickStart
Install dependencies into a new conda environment::
$ conda env update -f ci/environment.yml $ conda activate intake-esm-dev
Make an editable install of intake-esm by running::
$ python -m pip install -e .
Install pre-commit <https://pre-commit.com>_ hooks on the intake-esm repo::
pre-commit <https://pre-commit.com>
$ pre-commit install
Afterwards pre-commit will run whenever you commit.
pre-commit
pre-commit is a framework for managing and maintaining multi-language pre-commit hooks to ensure code-style and code formatting is consistent.
Now you have an environment called intake-esm-dev that you can work in. You’ll need to make sure to activate that environment next time you want to use it after closing the terminal or your system.
intake-esm-dev
(Optional) Run all the tests
Now running tests is as simple as issuing this command::
$ pytest --cov=./
This command will run tests via the pytest tool.
pytest
Commit and push once your tests pass and you are happy with your change(s)::
When committing, pre-commit will re-format the files if necessary.
$ git commit -a -m "<commit message>" $ git push -u
Finally, submit a pull request through the GitHub website using this data::
head-fork: YOUR_GITHUB_USERNAME/intake-esm compare: your-branch-name base-fork: intake/intake-esm base: master # if it's a bugfix or feature
(full changelog)
Fix memory error when computing unique values #313 (@andersy005)
📦 Drop support for Python 3.6 #311 (@andersy005)
⬆️ Upgrade dependencies & pin versions in CI environment #314 (@andersy005)
💚 Fix failing upstream-dev CI #310 (@andersy005)
Update MPI catalogs for MISTRAL #308 (@aaronspring)
(GitHub contributors page for this release)
@aaronspring | @andersy005 | @jbusecke
🐛 Disable _requested_variables for single variable assets #306 (@andersy005)
_requested_variables
Update changelog in preparation for new release #307 (@andersy005)
Use github-activity to update list of contributors #302 (@andersy005)
github-activity
Add nbqa & Update prettier commit hooks #300 (@andersy005)
Update pre-commit and GH actions #299 (@andersy005)
@andersy005 | @dcherian | @jbusecke | @naomi-henderson | @Recalculate
✨ Support multiple variable assets/files. (GH#287) @andersy005
✨ Add utility function for printing version information. (GH#284) @andersy005
💥 Remove unnecessary logging bits. (GH#297) @andersy005
✔️ Fix test failures. (GH#280) @andersy005
Fix TypeError bug in .search() method when using wildcard and regular expressions. (GH#285) @andersy005
.search()
Use file like object when dealing with netcdf in the cloud. (GH#292) @andersy005
📚 Fix ReadtheDocs documentation builds. (GH#286) @andersy005
📚 Migrate docs from restructured text to markdown via myst-parsers. (GH#296) @andersy005
myst-parsers
🔨 Refactor documentation contents & add new notebooks. (GH#298) @andersy005
Fix import errors due to intake/intake#526. (GH#282) @andersy005
Migrate CI from CircleCI to GitHub Actions. (GH#283) @andersy005
Use mamba to speed up CI testing. (GH#293) @andersy005
Enable dependabot updates. (GH#294) @andersy005
Test against Python 3.9. (GH#295) @andersy005
@andersy005 | @dcherian | @jbusecke | @jukent | @sherimickelson
Support regular expression objects in search() (GH#236) @andersy005
Support wildcard expresssions in search() (GH#259) @andersy005
Expose attributes used when aggregating/combining datasets (GH#268) @andersy005
Support turning aggregations off (GH#269) @andersy005
Improve error messages (GH#270) @andersy005
Expose aggregations options passed to xarray during datasets aggregation (GH#272) @andersy005
Reset _entries dict after updating aggregations (GH#274) @andersy005
_entries
Update to_dataset_dict() docstring to inform users on how cdf_kwargs argument is used in regards to chunking (GH#278) @bonnland
cdf_kwargs
Update pre-commit hooks & GitHub actions (GH#260) @andersy005
Update badges (GH#258) @andersy005
Update upstream environment (GH#263) @andersy005
Refactor search functionality into a standalone module (GH#267) @andersy005
Fix dask/concurrent.futures parallelism (GH#271) @andersy005
Increase test coverage to ~100% (GH#273) @andersy005
Bump minimum required versions (GH#275) @andersy005
@andersy005 | @bonnland | @dcherian | @jeffdlb | @jukent | @kmpaul | @markusritschel | @martindurant | @matt-long
Add df property setter (GH#247) @andersy005
Use Pandas sphinx theme (GH#244) @andersy005
Update documentation tutorial (GH#252) @andersy005 & @charlesbluca
Fix anti-patterns and other bug risks (GH#251) @andersy005
Sync with intake’s Entry unification (GH#249) @andersy005
@andersy005 | @jhamman | @martindurant
Provide informative message/warnings from empty queries. (GH#235) @andersy005
Replace tqdm progressbar with fastprogress. (GH#238) @andersy005
Add catalog_file attribute to esm_datastore class. (GH#240) @andersy005
catalog_file
@andersy005 | @bonnland | @dcherian | @jbusecke | @jeffdlb | @kmpaul | @markusritschel
Add html representation for the catalog object. (GH#229) @andersy005
Move logic for assets aggregation into ESMGroupDataSource() and add few basic dict-like methods (keys(), len(), getitem(), contains()) to the catalog object. (GH#194) @andersy005 & @jhamman & @kmpaul
ESMGroupDataSource()
keys()
len()
getitem()
contains()
Support columns with iterables in unique() and nunique(). (GH#223) @andersy005
nunique()
Revert back to using concurrent.futures to address failures due to dask’s distributed scheduler. (GH#225) & (GH#226)
concurrent.futures
Increase test coverage. (GH#222) @andersy005
@andersy005 | @bonnland | @dcherian | @jbusecke | @jhamman | @kmpaul | @sherimickelson
Support single file catalogs. (GH#195) @bonnland
Add progressbar argument to to_dataset_dict(). This allows the user to override the default progressbar value used during the class instantiation. (GH#204) @andersy005
progressbar
Enhanced search: enforce query criteria via require_all_on argument via search() method. (GH#202) & (GH#207) & (GH#209) @andersy005 & @jbusecke
Support relative paths for catalog files. (GH#208) @andersy005
Use raw path if protocol is None. (GH#210) @andersy005
None
Github Action to publish package to PyPI on release. (GH#190) @andersy005
Remove unnecessary inheritance. (GH#193) @andersy005
Update linting GitHub action to run on all pull requests. (GH#196) @andersy005
@andersy005 | @bonnland | @dcherian | @jbusecke | @jhamman | @kmpaul
Add optional preprocess argument to to_dataset_dict() (GH#155) @matt-long
Allow users to disable dataset aggregations by passing aggregate=False to to_dataset_dict() (GH#164) @matt-long
aggregate=False
Avoid manipulating dataset coordinates by using data_vars=varname when concatenating datasets via xarray {py:func}:~xarray.concat() (GH#174) @andersy005
data_vars=varname
~xarray.concat()
Support loading netCDF assets from openDAP endpoints (GH#176) @andersy005
Add serialize() method to serialize collection/catalog (GH#179) @andersy005
serialize()
Allow passing extra storage options to the backend file system via to_dataset_dict() (GH#180) @bonnland
Provide informational messages to the user via Logging module (GH#186) @andersy005
Remove the caching option (GH#158) @matt-long
Preserve encoding when aggregating datasets (GH#161) @matt-long
Sort aggregations to make sure {py:func}:~intake_esm.merge_util.join_existing is always done before {py:func}:~intake_esm.merge_util.join_new (GH#171) @andersy005
~intake_esm.merge_util.join_existing
~intake_esm.merge_util.join_new
Add example for preprocessing function (GH#168) @jbusecke
Add FAQ style document to documentation (GH#182) & (GH#177) @andersy005 & @jhamman
Simplify group loading by using concurrent.futures (GH#185) @andersy005
@andersy005 | @bonnland | @dcherian | @jbusecke | @jhamman | @matt-long | @naomi-henderson | @Recalculate | @sebasblancogonz
Rewrite intake-esm’s core based on (esm-collection-spec)_ Earth System Model Collection specification (GH#135) @andersy005, @matt-long, @rabernat
(esm-collection-spec)
Replaced {py:class}:~intake_esm.core.esm_metadatastore with {py:class}:~intake_esm.core.esm_datastore, see the API reference for more details.
~intake_esm.core.esm_metadatastore
~intake_esm.core.esm_datastore
intake-esm won’t build collection catalogs anymore. intake-esm now expects an ESM collection JSON file as input. This JSON should conform to the Earth System Model Collection specification.
@aaronspring | @andersy005 | @bonnland | @dcherian | @n-henderson | @naomi-henderson | @rabernat
Add mistral data holdings to intake-esm-datastore (GH#133) @aaronspring
mistral
intake-esm-datastore
Add support for NA-CORDEX data holdings. (GH#115) @jukent
NA-CORDEX
Replace .csv with netCDF as serialization format when saving the built collection to disk. With netCDF, we can record very useful information into the global attributes of the netCDF dataset. (GH#119) @andersy005
.csv
netCDF
Add string representation of ESMMetadataStoreCatalog`` object ({pr}122`) @andersy005
ESMMetadataStoreCatalog`` object ({pr}
Automatically build missing collections by calling esm_metadatastore(collection_name="GLADE-CMIP5"). When the specified collection is part of the curated collections in intake-esm-datastore. (GH#124) @andersy005
esm_metadatastore(collection_name="GLADE-CMIP5")
In [1]: import intake In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP5") In [3]: # if "GLADE-CMIP5" collection isn't built already, the above is equivalent to: In [4]: col = intake.open_esm_metadatastore(collection_input_definition="GLADE-CMIP5")
Revert back to using official DRS attributes when building CMIP5 and CMIP6 collections. (GH#126) @andersy005
Add .df property for interfacing with the built collection via dataframe To maintain backwards compatiblity. (GH#127) @andersy005
Add unique() and nunique() methods for summarizing count and unique values in a collection. (GH#128) @andersy005
In [1]: import intake In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP5") In [3]: col Out[3]: GLADE-CMIP5 collection catalogue with 615853 entries: > 3 resource(s) > 1 resource_type(s) > 1 direct_access(s) > 1 activity(s) > 218 ensemble_member(s) > 51 experiment(s) > 312093 file_basename(s) > 615853 file_fullpath(s) > 6 frequency(s) > 25 institute(s) > 15 mip_table(s) > 53 model(s) > 7 modeling_realm(s) > 3 product(s) > 9121 temporal_subset(s) > 454 variable(s) > 489 version(s) In[4]: col.nunique() resource 3 resource_type 1 direct_access 1 activity 1 ensemble_member 218 experiment 51 file_basename 312093 file_fullpath 615853 frequency 6 institute 25 mip_table 15 model 53 modeling_realm 7 product 3 temporal_subset 9121 variable 454 version 489 dtype: int64 In[4]: col.unique(columns=['frequency', 'modeling_realm']) {'frequency': {'count': 6, 'values': ['mon', 'day', '6hr', 'yr', '3hr', 'fx']}, 'modeling_realm': {'count': 7, 'values': ['atmos', 'land', 'ocean', 'seaIce', 'ocnBgchem', 'landIce', 'aerosol']}}
For CMIP6, extract grid_label from directory path instead of file name. (GH#127) @andersy005
grid_label
Support building collections using inputs from intake-esm-datastore repository. (GH#79) @andersy005
Ensure that requested files are available locally before loading data into xarray datasets. (GH#82) @andersy005 and @matt-long
Split collection definitions out of config. (GH#83) @matt-long
Add intake-esm-builder, a CLI tool for building collection from the command line. (GH#89) @andersy005
intake-esm-builder
Add support for CESM-LENS data holdings residing in AWS S3. (GH#98) @andersy005
Sort collection upon creation according to order-by-columns, pass urlpath through stack for use in parsing collection filenames (GH#100) @pbranson
Fix bug in _list_files_hsi() to return list instead of filter object. (GH#81) @matt-long and @andersy005
_list_files_hsi()
cesm._get_file_attrs fixed to break loop when longest stream is matched. (GH#80) @matt-long
cesm._get_file_attrs
stream
Restore non_dim_coords to data variables all the time. (GH#90) @andersy005
non_dim_coords
Fix bug in intake_esm/cesm.py that caused intake-esm to exclude hourly (1hr, 6hr, etc..) CESM-LE data. (GH#110) @andersy005
intake_esm/cesm.py
Fix bugs in intake_esm/cmip.py that caused improper regular expression matching for table_id and grid_label. (GH#113) & (GH#111) @naomi-henderson and @andersy005
intake_esm/cmip.py
Refactor existing functionality to make intake-esm robust and extensible. (GH#77) @andersy005
Add aggregate._override_coords function to override dim coordinates except time in case there’s floating point precision difference. (GH#108) @andersy005
aggregate._override_coords
Fix CESM-LE ice component peculiarities that caused intake-esm to load data improperly. The fix separates variables for ice component into two separate components:
ice
ice_sh: for southern hemisphere
ice_sh
ice_nh: for northern hemisphere
ice_nh
(GH#114) @andersy005
Add implementation for The Gridded Meteorological Ensemble Tool (GMET) data holdings (GH#61) @andersy005
Allow users to specify exclude*dirs for CMIP collections (GH#63) & (GH#62) @andersy005
Keep CMIP6 tracking_id in merge_keys (GH#67) @andersy005
tracking_id
merge_keys
Add implementation for ERA5 datasets (GH#68) @andersy005
Add implementations for CMIPCollection and CMIPSource (GH#38) @andersy005
CMIPCollection
CMIPSource
Add support for CMIP6 data (GH#46) @andersy005
Add implementation for The Max Planck Institute Grand Ensemble (MPI-GE) data holdings (GH#52) & (GH#51) @aaronspring and @andersy005
Return dictionary of datasets all the time for consistency (GH#56) @andersy005
Include multiple netcdf files in same subdirectory (GH#55) & (GH#54) @naomi-henderson and @andersy005
Allow CMIP integration (GH#35) @andersy005
Fix bug on build catalog and move exclude_dirs to locations (GH#33) @matt-long
exclude_dirs
locations
Change Logger, update dev-environment dependencies, and formatting fix in input.yml (GH#31) @matt-long
Update CircleCI workflow (GH#32) @andersy005
Rename package from intake-cesm to intake-esm (GH#34) @andersy005
intake-cesm