Intake-esm is a data cataloging utility built on top of intake, pandas, and xarray. Intake-esm aims to facilitate:
the discovery of earth’s climate and weather datasets.
the ingestion of these datasets into xarray dataset containers.
It’s basic usage is shown below. To begin, let’s import intake:
intake
import intake
At import time, intake-esm plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore() function. For demonstration purposes, we are going to use the catalog for Community Earth System Model Large ensemble (CESM LENS) dataset publicly available in Amazon S3.
esm_datastore
intake.open_esm_datastore()
Note
You can learn more about CESM LENS dataset in AWS S3 here
You can load data from an ESM Catalog by providing the URL to valid ESM Catalog:
catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json" col = intake.open_esm_datastore(catalog_url) col
aws-cesm1-le catalog with 56 dataset(s) from 429 asset(s):
The summary above tells us that this catalog contains over 400 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:
col.df.head()
To get unique values for given columns in the catalog, intake-esm provides a unique() method. This method returns a dictionary containing count, and unique values:
unique()
col.unique(columns=["component", "frequency", "experiment"])
{'component': {'count': 5, 'values': ['atm', 'ice_nh', 'ice_sh', 'lnd', 'ocn']}, 'frequency': {'count': 6, 'values': ['daily', 'hourly6-1990-2005', 'hourly6-2026-2035', 'hourly6-2071-2080', 'monthly', 'static']}, 'experiment': {'count': 4, 'values': ['20C', 'CTRL', 'HIST', 'RCP85']}}
The search() method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must be the names of the columns in the catalog. The search method returns a subset of the catalog with all the entries that match the provided query.
search()
By default, the search() method looks for exact matches
col_subset = col.search( component=["ice_nh", "lnd"], frequency=["monthly"], experiment=["20C", "HIST"], ) col_subset.df
As pointed earlier, the search method looks for exact matches by default. However, with use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column:
# Find all entries with `wind` in their variable_long_name col.search(variable_long_name="wind*").df
# Find all entries whose variable long name starts with `wind` col.search(variable_long_name="^wind").df
Intake-esm implements convenience utilities for loading the query results into higher level xarray datasets. The logic for merging/concatenating the query results into higher level xarray datasets is provided in the input JSON file and is available under .aggregation_info property:
.aggregation_info
col.aggregation_info
AggregationInfo(groupby_attrs=['component', 'experiment', 'frequency'], variable_column_name='variable', aggregations=[{'type': 'union', 'attribute_name': 'variable', 'options': {'compat': 'override'}}], agg_columns=['variable'], aggregation_dict={'variable': {'type': 'union', 'options': {'compat': 'override'}}})
col.aggregation_info.aggregations
[{'type': 'union', 'attribute_name': 'variable', 'options': {'compat': 'override'}}]
# Dataframe columns used to determine groups of compatible datasets. col.aggregation_info.groupby_attrs # or col.groupby_attrs
['component', 'experiment', 'frequency']
# List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset. col.aggregation_info.agg_columns # or col.agg_columns
['variable']
To load data assets into xarray datasets, we need to use the to_dataset_dict() method. This method returns a dictionary of aggregate xarray datasets as the name hints.
to_dataset_dict()
dset_dicts = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
--> The keys in the returned dictionary of datasets are constructed as follows: 'component.experiment.frequency'
[key for key in dset_dicts.keys()]
['ice_nh.HIST.monthly', 'ice_nh.20C.monthly', 'lnd.HIST.monthly', 'lnd.20C.monthly']
We can access a particular dataset as follows:
ds = dset_dicts["lnd.20C.monthly"] print(ds)
<xarray.Dataset> Dimensions: (hist_interval: 2, lat: 192, levgrnd: 15, lon: 288, member_id: 40, time: 1032) Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 ... 88.12 89.06 90.0 * lon (lon) float32 0.0 1.25 2.5 3.75 ... 356.25 357.5 358.75 * member_id (member_id) int64 1 2 3 4 5 6 7 ... 35 101 102 103 104 105 * time (time) object 1920-01-16 12:00:00 ... 2005-12-16 12:00:00 time_bounds (time, hist_interval) object dask.array<chunksize=(1032, 2), meta=np.ndarray> * levgrnd (levgrnd) float32 0.007100635 0.027925 ... 21.32647 35.17762 Dimensions without coordinates: hist_interval Data variables: FSNO (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> H2OSNO (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> QRUNOFF (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> RAIN (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> SNOW (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> SOILLIQ (member_id, time, levgrnd, lat, lon) float32 dask.array<chunksize=(1, 40, 15, 192, 288), meta=np.ndarray> SOILWATER_10CM (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray> Attributes: hostname: tcs nco_openmp_thread_number: 1 username: mudryk title: CLM History file information version: cesm1_1_1_alpha01g Conventions: CF-1.0 case_title: UNSET revision_id: $Id: histFileMod.F90 40539 2012-09-... source: Community Land Model CLM4.0 Surface_dataset: surfdata_0.9x1.25_simyr1850_c110921.nc Initial_conditions_dataset: b.e11.B20TRC5CNBDRD.f09_g16.001.clm... comment: NOTE: None of the variables are wei... PFT_physiological_constants_dataset: pft-physiology.c110425.nc NCO: 4.3.4 intake_esm_varname: FSNO\nH2OSNO\nQRUNOFF\nRAIN\nSNOW\n... intake_esm_dataset_key: lnd.20C.monthly
Let’s create a quick plot for a slice of the data:
ds.SNOW.isel(time=0, member_id=range(1, 24, 4)).plot(col="member_id", col_wrap=3, robust=True)
<xarray.plot.facetgrid.FacetGrid at 0x1657d0820>