Understanding intake-ESM keys and how to use them#
Intake-ESM helps with aggregating your datasets using some keys. Here, we dig into what exactly these keys are, how they are constructed, and how you can change them. Understanding how this work will help you control how your datasets are merged together, and remove the mystery behind these strings of text.
Import packages and spin up a Dask cluster#
We start first with importing intake and a Client from dask.distributed
import intake
from distributed import Client
client = Client()
Investigate a CMIP6 catalog#
Let’s start with a sample CMIP6 catalog! This is a fairly large dataset.
url ="https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
catalog = intake.open_esm_datastore(url)
catalog.df.head()
| activity_id | institution_id | source_id | experiment_id | member_id | table_id | variable_id | grid_label | zstore | dcpp_init_year | version | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CMIP | IPSL | IPSL-CM6A-LR | historical | r2i1p1f1 | Amon | va | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 1 | CMIP | IPSL | IPSL-CM6A-LR | historical | r2i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 2 | CMIP | IPSL | IPSL-CM6A-LR | historical | r8i1p1f1 | Oyr | o2 | gn | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 3 | CMIP | IPSL | IPSL-CM6A-LR | historical | r30i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 4 | CMIP | IPSL | IPSL-CM6A-LR | historical | r30i1p1f1 | Amon | va | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
Typically, the next step would be to search and load your datasets using to_dataset_dict() or to_datatree()
catalog_subset = catalog.search(variable_id='ua')
dsets = catalog_subset.to_dataset_dict()
print(dsets)
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:109: UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.
ds = xr.open_dataset(url, **xarray_open_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:109: UserWarning: The specified chunks separate the stored chunks along dimension "plev" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.
ds = xr.open_dataset(url, **xarray_open_kwargs)
{'CMIP.CCCma.CanESM5.historical.Amon.gn': <xarray.Dataset> Size: 80GB
Dimensions: (member_id: 65, dcpp_init_year: 1, time: 1980, plev: 19,
lat: 64, lon: 128, bnds: 2)
Coordinates:
* member_id (member_id) object 520B 'r10i1p1f1' ... 'r9i1p2f1'
* dcpp_init_year (dcpp_init_year) float64 8B nan
* time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:...
* plev (plev) float64 152B 1e+05 9.25e+04 8.5e+04 ... 500.0 100.0
* lat (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86
* lon (lon) float64 1kB 0.0 2.812 5.625 ... 351.6 354.4 357.2
lat_bnds (lat, bnds) float64 1kB dask.array<chunksize=(64, 2), meta=np.ndarray>
lon_bnds (lon, bnds) float64 2kB dask.array<chunksize=(128, 2), meta=np.ndarray>
time_bnds (time, bnds) object 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
ua (member_id, dcpp_init_year, time, plev, lat, lon) float32 80GB dask.array<chunksize=(1, 1, 174, 19, 64, 128), meta=np.ndarray>
Attributes: (12/46)
Conventions: CF-1.7 CMIP-6.2
YMDH_branch_time_in_child: 1850:01:01:00
activity_id: CMIP
branch_method: Spin-up documentation
branch_time_in_child: 0.0
contact: ec.cccma.info-info.ccmac.ec@canada.ca
... ...
intake_esm_attrs:table_id: Amon
intake_esm_attrs:variable_id: ua
intake_esm_attrs:grid_label: gn
intake_esm_attrs:version: 20190429
intake_esm_attrs:_data_format_: zarr
intake_esm_dataset_key: CMIP.CCCma.CanESM5.historical.Amon.gn, 'CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr': <xarray.Dataset> Size: 99GB
Dimensions: (member_id: 32, dcpp_init_year: 1, time: 1980, plev: 19,
lat: 143, lon: 144, axis_nbounds: 2)
Coordinates:
* member_id (member_id) object 256B 'r10i1p1f1' ... 'r9i1p1f1'
* dcpp_init_year (dcpp_init_year) float64 8B nan
* time (time) datetime64[ns] 16kB 1850-01-16T12:00:00 ... 2014-1...
* plev (plev) float32 76B 1e+05 9.25e+04 8.5e+04 ... 500.0 100.0
* lat (lat) float32 572B -90.0 -88.73 -87.46 ... 87.46 88.73 90.0
* lon (lon) float32 576B 0.0 2.5 5.0 7.5 ... 352.5 355.0 357.5
time_bounds (time, axis_nbounds) datetime64[ns] 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: axis_nbounds
Data variables:
ua (member_id, dcpp_init_year, time, plev, lat, lon) float32 99GB dask.array<chunksize=(1, 1, 1, 1, 143, 144), meta=np.ndarray>
Attributes: (12/48)
Conventions: CF-1.7 CMIP-6.2
EXPID: historical
activity_id: CMIP
branch_method: standard
branch_time_in_child: 0.0
contact: ipsl-cmip6@listes.ipsl.fr
... ...
intake_esm_attrs:variable_id: ua
intake_esm_attrs:grid_label: gr
intake_esm_attrs:_data_format_: zarr
NCO: "4.6.0"
variant_info: Restart from another point in piControl...
intake_esm_dataset_key: CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr}
Investigating the keys#
The keys for these datasets include some helpful information - but you might be wondering what this all means and where this text comes from…
print(list(dsets))
['CMIP.CCCma.CanESM5.historical.Amon.gn', 'CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr']
When intake-esm aggregates these datasets, it uses some pre-determined metadata, defined in the catalog file. We can look at which fields are used for aggregation, or merging of the datasets, using the following
print(catalog.esmcat.aggregation_control.groupby_attrs)
['activity_id', 'institution_id', 'source_id', 'experiment_id', 'table_id', 'grid_label']
Let’s go back to our data catalog… and find these fields. You’ll notice they are all column labels! These are key components of the metadata.
catalog_subset.df
| activity_id | institution_id | source_id | experiment_id | member_id | table_id | variable_id | grid_label | zstore | dcpp_init_year | version | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CMIP | IPSL | IPSL-CM6A-LR | historical | r2i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 1 | CMIP | IPSL | IPSL-CM6A-LR | historical | r30i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 2 | CMIP | IPSL | IPSL-CM6A-LR | historical | r8i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 3 | CMIP | IPSL | IPSL-CM6A-LR | historical | r29i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 4 | CMIP | IPSL | IPSL-CM6A-LR | historical | r3i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92 | CMIP | CCCma | CanESM5 | historical | r31i1p2f1 | Amon | ua | gn | gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... | NaN | 20190429 |
| 93 | CMIP | CCCma | CanESM5 | historical | r33i1p2f1 | Amon | ua | gn | gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... | NaN | 20190429 |
| 94 | CMIP | CCCma | CanESM5 | historical | r6i1p2f1 | Amon | ua | gn | gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... | NaN | 20190429 |
| 95 | CMIP | CCCma | CanESM5 | historical | r6i1p1f1 | Amon | ua | gn | gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... | NaN | 20190429 |
| 96 | CMIP | IPSL | IPSL-CM6A-LR | historical | r32i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20190802 |
97 rows × 11 columns
Using keys_info()#
These groupby attributes are columns in our catalog! This means that the datasets which will be aggregated using the hierarchy:
activity_id --> institution_id --> source_id --> experiment_id --> table_id --> grid_label
A more clear of taking a look at these aggregation variables using the .keys_info() method for the catalog:
catalog_subset.keys_info()
| activity_id | institution_id | source_id | experiment_id | table_id | grid_label | |
|---|---|---|---|---|---|---|
| key | ||||||
| CMIP.CCCma.CanESM5.historical.Amon.gn | CMIP | CCCma | CanESM5 | historical | Amon | gn |
| CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr | CMIP | IPSL | IPSL-CM6A-LR | historical | Amon | gr |
Change our groupby/aggregation controls#
If we wanted to instead aggregate our datasets at the member_id level, we can change that using the following method:
original_groupby_attributes = catalog.esmcat.aggregation_control.groupby_attrs
new_groupby_attributes = original_groupby_attributes + ["member_id"]
print(new_groupby_attributes)
['activity_id', 'institution_id', 'source_id', 'experiment_id', 'table_id', 'grid_label', 'member_id']
Now that we have our new groupby attributes, we can assign these to our catalog subset.
catalog_subset.esmcat.aggregation_control.groupby_attrs = new_groupby_attributes
Let’s check our new keys! You’ll notice we now have 97 keys, aggregated on
activity_id --> institution_id --> source_id --> experiment_id --> table_id --> grid_label --> member_id
catalog_subset.keys_info()
| activity_id | institution_id | source_id | experiment_id | table_id | grid_label | member_id | |
|---|---|---|---|---|---|---|---|
| key | |||||||
| CMIP.CCCma.CanESM5.historical.Amon.gn.r10i1p1f1 | CMIP | CCCma | CanESM5 | historical | Amon | gn | r10i1p1f1 |
| CMIP.CCCma.CanESM5.historical.Amon.gn.r10i1p2f1 | CMIP | CCCma | CanESM5 | historical | Amon | gn | r10i1p2f1 |
| CMIP.CCCma.CanESM5.historical.Amon.gn.r11i1p1f1 | CMIP | CCCma | CanESM5 | historical | Amon | gn | r11i1p1f1 |
| CMIP.CCCma.CanESM5.historical.Amon.gn.r11i1p2f1 | CMIP | CCCma | CanESM5 | historical | Amon | gn | r11i1p2f1 |
| CMIP.CCCma.CanESM5.historical.Amon.gn.r12i1p1f1 | CMIP | CCCma | CanESM5 | historical | Amon | gn | r12i1p1f1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r5i1p1f1 | CMIP | IPSL | IPSL-CM6A-LR | historical | Amon | gr | r5i1p1f1 |
| CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r6i1p1f1 | CMIP | IPSL | IPSL-CM6A-LR | historical | Amon | gr | r6i1p1f1 |
| CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r7i1p1f1 | CMIP | IPSL | IPSL-CM6A-LR | historical | Amon | gr | r7i1p1f1 |
| CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r8i1p1f1 | CMIP | IPSL | IPSL-CM6A-LR | historical | Amon | gr | r8i1p1f1 |
| CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r9i1p1f1 | CMIP | IPSL | IPSL-CM6A-LR | historical | Amon | gr | r9i1p1f1 |
97 rows × 7 columns
Load our datasets with the new keys#
We can now load our new datasets to our dictionary of datasets using:
dsets = catalog_subset.to_dataset_dict()
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label.member_id'
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:109: UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.
ds = xr.open_dataset(url, **xarray_open_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:109: UserWarning: The specified chunks separate the stored chunks along dimension "plev" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.
ds = xr.open_dataset(url, **xarray_open_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:109: UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.
ds = xr.open_dataset(url, **xarray_open_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:109: UserWarning: The specified chunks separate the stored chunks along dimension "plev" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.
ds = xr.open_dataset(url, **xarray_open_kwargs)
And if we only wanted the first key, we could use the following to grab the first key in the list. Notice how we now have our member_id at the end!
first_key = catalog_subset.keys()[0]
first_key
'CMIP.CCCma.CanESM5.historical.Amon.gn.r10i1p1f1'
And the .to_dask() method to load our dataset into our notebook.
ds = catalog_subset[first_key].to_dask()
ds
<xarray.Dataset> Size: 1GB
Dimensions: (member_id: 1, dcpp_init_year: 1, time: 1980, plev: 19,
lat: 64, lon: 128, bnds: 2)
Coordinates:
* member_id (member_id) object 8B 'r10i1p1f1'
* dcpp_init_year (dcpp_init_year) float64 8B nan
* time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:...
* plev (plev) float64 152B 1e+05 9.25e+04 8.5e+04 ... 500.0 100.0
* lat (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86
* lon (lon) float64 1kB 0.0 2.812 5.625 ... 351.6 354.4 357.2
lat_bnds (lat, bnds) float64 1kB dask.array<chunksize=(64, 2), meta=np.ndarray>
lon_bnds (lon, bnds) float64 2kB dask.array<chunksize=(128, 2), meta=np.ndarray>
time_bnds (time, bnds) object 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
ua (member_id, dcpp_init_year, time, plev, lat, lon) float32 1GB dask.array<chunksize=(1, 1, 180, 19, 64, 128), meta=np.ndarray>
Attributes: (12/67)
CCCma_model_hash: 55f484f90aff0e32c5a8e92a42c6b9ae7ffe6224
CCCma_parent_runid: rc3.1-pictrl
CCCma_pycmor_hash: 33c30511acc319a98240633965a04ca99c26427e
CCCma_runid: rc3.1-his10
Conventions: CF-1.7 CMIP-6.2
YMDH_branch_time_in_child: 1850:01:01:00
... ...
intake_esm_attrs:variable_id: ua
intake_esm_attrs:grid_label: gn
intake_esm_attrs:zstore: gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/his...
intake_esm_attrs:version: 20190429
intake_esm_attrs:_data_format_: zarr
intake_esm_dataset_key: CMIP.CCCma.CanESM5.historical.Amon.gn.r...Compare this dataset with the original catalog configuration#
Compare this to our original catalog, which aggregated one level higher, placing all of the member_ids into the same dataset.
Note
Notice how our metadata now mentions there are 65 member_ids in this dataset, compared to 1 in the previous dataset
original_ds = catalog[catalog.keys()[0]].to_dask()
original_ds
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:308: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
self._ds = xr.combine_by_coords(
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:308: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
self._ds = xr.combine_by_coords(
/home/docs/checkouts/readthedocs.org/user_builds/intake-esm/checkouts/latest/intake_esm/source.py:308: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
self._ds = xr.combine_by_coords(
<xarray.Dataset> Size: 160GB
Dimensions: (member_id: 65, dcpp_init_year: 1, time: 1980, plev: 19,
lat: 64, lon: 128, bnds: 2)
Coordinates:
* member_id (member_id) object 520B 'r10i1p1f1' ... 'r9i1p2f1'
* dcpp_init_year (dcpp_init_year) float64 8B nan
* time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:...
* plev (plev) float64 152B 1e+05 9.25e+04 8.5e+04 ... 500.0 100.0
* lat (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86
* lon (lon) float64 1kB 0.0 2.812 5.625 ... 351.6 354.4 357.2
lat_bnds (lat, bnds) float64 1kB dask.array<chunksize=(64, 2), meta=np.ndarray>
lon_bnds (lon, bnds) float64 2kB dask.array<chunksize=(128, 2), meta=np.ndarray>
time_bnds (time, bnds) object 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
ua (member_id, dcpp_init_year, time, plev, lat, lon) float32 80GB dask.array<chunksize=(1, 1, 174, 19, 64, 128), meta=np.ndarray>
va (member_id, dcpp_init_year, time, plev, lat, lon) float32 80GB dask.array<chunksize=(1, 1, 170, 19, 64, 128), meta=np.ndarray>
Attributes: (12/43)
Conventions: CF-1.7 CMIP-6.2
YMDH_branch_time_in_child: 1850:01:01:00
activity_id: CMIP
branch_method: Spin-up documentation
branch_time_in_child: 0.0
contact: ec.cccma.info-info.ccmac.ec@canada.ca
... ...
intake_esm_attrs:experiment_id: historical
intake_esm_attrs:table_id: Amon
intake_esm_attrs:grid_label: gn
intake_esm_attrs:version: 20190429
intake_esm_attrs:_data_format_: zarr
intake_esm_dataset_key: CMIP.CCCma.CanESM5.historical.Amon.gnConclusion#
These intake-esm keys can be a bit abstract when first accessing your data, but understanding them is essential to understand how intake-esm aggregates your data, and how you can change these aggregation controls for your desired analysis! We hope this helped demystify these keys.