Understanding intake-ESM keys and how to use them#

Intake-ESM helps with aggregating your datasets using some keys. Here, we dig into what exactly these keys are, how they are constructed, and how you can change them. Understanding how this work will help you control how your datasets are merged together, and remove the mystery behind these strings of text.

Import packages and spin up a Dask cluster#

We start first with importing intake and a Client from dask.distributed

import intake
from distributed import Client

client = Client()

Investigate a CMIP6 catalog#

Let’s start with a sample CMIP6 catalog! This is a fairly large dataset.

url ="https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
catalog = intake.open_esm_datastore(url)
catalog.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CMIP IPSL IPSL-CM6A-LR historical r2i1p1f1 Amon va gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
1 CMIP IPSL IPSL-CM6A-LR historical r2i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
2 CMIP IPSL IPSL-CM6A-LR historical r8i1p1f1 Oyr o2 gn gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
3 CMIP IPSL IPSL-CM6A-LR historical r30i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
4 CMIP IPSL IPSL-CM6A-LR historical r30i1p1f1 Amon va gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803

Typically, the next step would be to search and load your datasets using to_dataset_dict() or to_datatree()

catalog_subset = catalog.search(variable_id='ua')
dsets = catalog_subset.to_dataset_dict()
print(dsets)
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [2/2 00:26<00:00]
{'CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr': <xarray.Dataset>
Dimensions:         (lat: 143, lon: 144, plev: 19, time: 1980, axis_nbounds: 2,
                     member_id: 32, dcpp_init_year: 1)
Coordinates:
  * lat             (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
  * lon             (lon) float32 0.0 2.5 5.0 7.5 ... 350.0 352.5 355.0 357.5
  * plev            (plev) float32 1e+05 9.25e+04 8.5e+04 ... 1e+03 500.0 100.0
  * time            (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T...
    time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id       (member_id) object 'r10i1p1f1' 'r11i1p1f1' ... 'r9i1p1f1'
  * dcpp_init_year  (dcpp_init_year) float64 nan
Dimensions without coordinates: axis_nbounds
Data variables:
    ua              (member_id, dcpp_init_year, time, plev, lat, lon) float32 dask.array<chunksize=(1, 1, 60, 19, 143, 144), meta=np.ndarray>
Attributes: (12/48)
    Conventions:                      CF-1.7 CMIP-6.2
    EXPID:                            historical
    activity_id:                      CMIP
    branch_method:                    standard
    branch_time_in_child:             0.0
    contact:                          ipsl-cmip6@listes.ipsl.fr
    ...                               ...
    intake_esm_attrs:variable_id:     ua
    intake_esm_attrs:grid_label:      gr
    intake_esm_attrs:_data_format_:   zarr
    NCO:                              "4.6.0"
    variant_info:                     Restart from another point in piControl...
    intake_esm_dataset_key:           CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr, 'CMIP.CCCma.CanESM5.historical.Amon.gn': <xarray.Dataset>
Dimensions:         (lat: 64, bnds: 2, lon: 128, plev: 19, time: 1980,
                     member_id: 65, dcpp_init_year: 1)
Coordinates:
  * lat             (lat) float64 -87.86 -85.1 -82.31 ... 82.31 85.1 87.86
    lat_bnds        (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon             (lon) float64 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2
    lon_bnds        (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * plev            (plev) float64 1e+05 9.25e+04 8.5e+04 ... 1e+03 500.0 100.0
  * time            (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds       (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id       (member_id) object 'r10i1p1f1' 'r10i1p2f1' ... 'r9i1p2f1'
  * dcpp_init_year  (dcpp_init_year) float64 nan
Dimensions without coordinates: bnds
Data variables:
    ua              (member_id, dcpp_init_year, time, plev, lat, lon) float32 dask.array<chunksize=(1, 1, 60, 19, 64, 128), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                      CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:        1850:01:01:00
    activity_id:                      CMIP
    branch_method:                    Spin-up documentation
    branch_time_in_child:             0.0
    contact:                          ec.cccma.info-info.ccmac.ec@canada.ca
    ...                               ...
    intake_esm_attrs:table_id:        Amon
    intake_esm_attrs:variable_id:     ua
    intake_esm_attrs:grid_label:      gn
    intake_esm_attrs:version:         20190429
    intake_esm_attrs:_data_format_:   zarr
    intake_esm_dataset_key:           CMIP.CCCma.CanESM5.historical.Amon.gn}

Investigating the keys#

The keys for these datasets include some helpful information - but you might be wondering what this all means and where this text comes from…

print(list(dsets))
['CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr', 'CMIP.CCCma.CanESM5.historical.Amon.gn']

When intake-esm aggregates these datasets, it uses some pre-determined metadata, defined in the catalog file. We can look at which fields are used for aggregation, or merging of the datasets, using the following

print(catalog.esmcat.aggregation_control.groupby_attrs)
['activity_id', 'institution_id', 'source_id', 'experiment_id', 'table_id', 'grid_label']

Let’s go back to our data catalog… and find these fields. You’ll notice they are all column labels! These are key components of the metadata.

catalog_subset.df
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CMIP IPSL IPSL-CM6A-LR historical r2i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
1 CMIP IPSL IPSL-CM6A-LR historical r30i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
2 CMIP IPSL IPSL-CM6A-LR historical r8i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
3 CMIP IPSL IPSL-CM6A-LR historical r29i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
4 CMIP IPSL IPSL-CM6A-LR historical r3i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
... ... ... ... ... ... ... ... ... ... ... ...
92 CMIP CCCma CanESM5 historical r31i1p2f1 Amon ua gn gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... NaN 20190429
93 CMIP CCCma CanESM5 historical r33i1p2f1 Amon ua gn gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... NaN 20190429
94 CMIP CCCma CanESM5 historical r6i1p2f1 Amon ua gn gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... NaN 20190429
95 CMIP CCCma CanESM5 historical r6i1p1f1 Amon ua gn gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... NaN 20190429
96 CMIP IPSL IPSL-CM6A-LR historical r32i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20190802

97 rows × 11 columns

Using keys_info()#

These groupby attributes are columns in our catalog! This means that the datasets which will be aggregated using the hierarchy:

activity_id --> institution_id --> source_id --> experiment_id --> table_id --> grid_label

A more clear of taking a look at these aggregation variables using the .keys_info() method for the catalog:

catalog_subset.keys_info()
activity_id institution_id source_id experiment_id table_id grid_label
key
CMIP.CCCma.CanESM5.historical.Amon.gn CMIP CCCma CanESM5 historical Amon gn
CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr CMIP IPSL IPSL-CM6A-LR historical Amon gr

Change our groupby/aggregation controls#

If we wanted to instead aggregate our datasets at the member_id level, we can change that using the following method:

original_groupby_attributes = catalog.esmcat.aggregation_control.groupby_attrs
new_groupby_attributes = original_groupby_attributes + ["member_id"]
print(new_groupby_attributes)
['activity_id', 'institution_id', 'source_id', 'experiment_id', 'table_id', 'grid_label', 'member_id']

Now that we have our new groupby attributes, we can assign these to our catalog subset.

catalog_subset.esmcat.aggregation_control.groupby_attrs = new_groupby_attributes

Let’s check our new keys! You’ll notice we now have 97 keys, aggregated on

activity_id --> institution_id --> source_id --> experiment_id --> table_id --> grid_label --> member_id
catalog_subset.keys_info()
activity_id institution_id source_id experiment_id table_id grid_label member_id
key
CMIP.CCCma.CanESM5.historical.Amon.gn.r10i1p1f1 CMIP CCCma CanESM5 historical Amon gn r10i1p1f1
CMIP.CCCma.CanESM5.historical.Amon.gn.r10i1p2f1 CMIP CCCma CanESM5 historical Amon gn r10i1p2f1
CMIP.CCCma.CanESM5.historical.Amon.gn.r11i1p1f1 CMIP CCCma CanESM5 historical Amon gn r11i1p1f1
CMIP.CCCma.CanESM5.historical.Amon.gn.r11i1p2f1 CMIP CCCma CanESM5 historical Amon gn r11i1p2f1
CMIP.CCCma.CanESM5.historical.Amon.gn.r12i1p1f1 CMIP CCCma CanESM5 historical Amon gn r12i1p1f1
... ... ... ... ... ... ... ...
CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r5i1p1f1 CMIP IPSL IPSL-CM6A-LR historical Amon gr r5i1p1f1
CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r6i1p1f1 CMIP IPSL IPSL-CM6A-LR historical Amon gr r6i1p1f1
CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r7i1p1f1 CMIP IPSL IPSL-CM6A-LR historical Amon gr r7i1p1f1
CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r8i1p1f1 CMIP IPSL IPSL-CM6A-LR historical Amon gr r8i1p1f1
CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr.r9i1p1f1 CMIP IPSL IPSL-CM6A-LR historical Amon gr r9i1p1f1

97 rows × 7 columns

Load our datasets with the new keys#

We can now load our new datasets to our dictionary of datasets using:

dsets = catalog_subset.to_dataset_dict()
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label.member_id'
100.00% [99/99 00:37<00:00]

And if we only wanted the first key, we could use the following to grab the first key in the list. Notice how we now have our member_id at the end!

first_key = catalog_subset.keys()[0]
first_key
'CMIP.CCCma.CanESM5.historical.Amon.gn.r10i1p1f1'

And the .to_dask() method to load our dataset into our notebook.

ds = catalog_subset[first_key].to_dask()
ds
<xarray.Dataset>
Dimensions:         (lat: 64, bnds: 2, lon: 128, plev: 19, time: 1980,
                     member_id: 1, dcpp_init_year: 1)
Coordinates:
  * lat             (lat) float64 -87.86 -85.1 -82.31 ... 82.31 85.1 87.86
    lat_bnds        (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon             (lon) float64 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2
    lon_bnds        (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * plev            (plev) float64 1e+05 9.25e+04 8.5e+04 ... 1e+03 500.0 100.0
  * time            (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds       (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id       (member_id) object 'r10i1p1f1'
  * dcpp_init_year  (dcpp_init_year) float64 nan
Dimensions without coordinates: bnds
Data variables:
    ua              (member_id, dcpp_init_year, time, plev, lat, lon) float32 dask.array<chunksize=(1, 1, 60, 19, 64, 128), meta=np.ndarray>
Attributes: (12/69)
    CCCma_model_hash:                 55f484f90aff0e32c5a8e92a42c6b9ae7ffe6224
    CCCma_parent_runid:               rc3.1-pictrl
    CCCma_pycmor_hash:                33c30511acc319a98240633965a04ca99c26427e
    CCCma_runid:                      rc3.1-his10
    Conventions:                      CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:        1850:01:01:00
    ...                               ...
    intake_esm_attrs:variable_id:     ua
    intake_esm_attrs:grid_label:      gn
    intake_esm_attrs:zstore:          gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/his...
    intake_esm_attrs:version:         20190429
    intake_esm_attrs:_data_format_:   zarr
    intake_esm_dataset_key:           CMIP.CCCma.CanESM5.historical.Amon.gn.r...

Compare this dataset with the original catalog configuration#

Compare this to our original catalog, which aggregated one level higher, placing all of the member_ids into the same dataset.

Note

Notice how our metadata now mentions there are 65 member_ids in this dataset, compared to 1 in the previous dataset

original_ds = catalog[catalog.keys()[0]].to_dask()
original_ds
<xarray.Dataset>
Dimensions:         (lat: 64, bnds: 2, lon: 128, plev: 19, time: 1980,
                     member_id: 65, dcpp_init_year: 1)
Coordinates:
  * lat             (lat) float64 -87.86 -85.1 -82.31 ... 82.31 85.1 87.86
    lat_bnds        (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon             (lon) float64 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2
    lon_bnds        (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * plev            (plev) float64 1e+05 9.25e+04 8.5e+04 ... 1e+03 500.0 100.0
  * time            (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds       (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id       (member_id) object 'r10i1p1f1' 'r10i1p2f1' ... 'r9i1p2f1'
  * dcpp_init_year  (dcpp_init_year) float64 nan
Dimensions without coordinates: bnds
Data variables:
    ua              (member_id, dcpp_init_year, time, plev, lat, lon) float32 dask.array<chunksize=(1, 1, 60, 19, 64, 128), meta=np.ndarray>
    va              (member_id, dcpp_init_year, time, plev, lat, lon) float32 dask.array<chunksize=(1, 1, 60, 19, 64, 128), meta=np.ndarray>
Attributes: (12/44)
    Conventions:                      CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:        1850:01:01:00
    activity_id:                      CMIP
    branch_method:                    Spin-up documentation
    branch_time_in_child:             0.0
    contact:                          ec.cccma.info-info.ccmac.ec@canada.ca
    ...                               ...
    intake_esm_attrs:experiment_id:   historical
    intake_esm_attrs:table_id:        Amon
    intake_esm_attrs:grid_label:      gn
    intake_esm_attrs:version:         20190429
    intake_esm_attrs:_data_format_:   zarr
    intake_esm_dataset_key:           CMIP.CCCma.CanESM5.historical.Amon.gn

Conclusion#

These intake-esm keys can be a bit abstract when first accessing your data, but understanding them is essential to understand how intake-esm aggregates your data, and how you can change these aggregation controls for your desired analysis! We hope this helped demystify these keys.

Hide code cell source
import intake_esm  # just to display version information
intake_esm.show_versions()
Hide code cell output
INSTALLED VERSIONS
------------------

cftime: 1.6.3
dask: 2024.1.1
fastprogress: 1.0.3
fsspec: 2024.2.0
gcsfs: 2024.2.0
intake: 0.7.0
intake_esm: 2024.2.6.post0+dirty
netCDF4: 1.6.5
pandas: 2.2.0
requests: 2.31.0
s3fs: 2024.2.0
xarray: 2024.1.1
zarr: 2.16.1