Modify a catalog#
import intake
The in-memory representation of an Earth System Model (ESM) catalog is a Pandas DataFrame, and is accessible via the .df property:
url ="https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
cat = intake.open_esm_datastore(url)
cat.df.head()
| activity_id | institution_id | source_id | experiment_id | member_id | table_id | variable_id | grid_label | zstore | dcpp_init_year | version | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CMIP | IPSL | IPSL-CM6A-LR | historical | r2i1p1f1 | Amon | va | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 1 | CMIP | IPSL | IPSL-CM6A-LR | historical | r2i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 2 | CMIP | IPSL | IPSL-CM6A-LR | historical | r8i1p1f1 | Oyr | o2 | gn | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 3 | CMIP | IPSL | IPSL-CM6A-LR | historical | r30i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
| 4 | CMIP | IPSL | IPSL-CM6A-LR | historical | r30i1p1f1 | Amon | va | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
In this notebook we will go through some examples showing how to modify this dataframe and some of its behavior during data loading steps.
Note
Pandas is a powerful tool for data manipulation. If you are not familiar with it, we recommend you to read the Pandas documentation.
Note
Intake-ESM is currently in the process of transitioning from using pandas to polars internally in order to handle larger catalogs more efficiently. There are however no plans to deprecate the pandas based esm_datastore.df attribute, or to change this attribute to return a polars dataframe instead.
For more information on the internal changes or if you wish to access the polars dataframe directly, please refer to the api documentation.
Use case 1: complex search queries#
Let’s say we are interested in datasets with the following attributes:
experiment_id=["historical"]table_id="Amon"variable_id="ua"
In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.
This can be achieved in two steps:
Step 1: run a query against the catalog#
We can run a query against the catalog:
cat_subset = cat.search(
experiment_id=["historical"],
table_id="Amon",
variable_id="ua",
)
cat_subset
GOOGLE-CMIP6 catalog with 2 dataset(s) from 97 asset(s):
| unique | |
|---|---|
| activity_id | 1 |
| institution_id | 2 |
| source_id | 2 |
| experiment_id | 1 |
| member_id | 72 |
| table_id | 1 |
| variable_id | 1 |
| grid_label | 2 |
| zstore | 97 |
| dcpp_init_year | 1 |
| version | 3 |
| derived_variable_id | 0 |
Step 2: select the first member_id for each source_id#
The subsetted catalog contains source_id with the following number of
member_id per source_id:
cat_subset.df.groupby("source_id")["member_id"].nunique()
source_id
CanESM5 65
IPSL-CM6A-LR 32
Name: member_id, dtype: int64
To get the first member_id for each source_id, we group the dataframe by
source_id and use the .first() method to retrieve the first member_id:
grouped = cat_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()
# Confirm that we have one ensemble member per source_id
df.groupby("source_id")["member_id"].nunique()
source_id
CanESM5 1
IPSL-CM6A-LR 1
Name: member_id, dtype: int64
df
| source_id | activity_id | institution_id | experiment_id | member_id | table_id | variable_id | grid_label | zstore | dcpp_init_year | version | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CanESM5 | CMIP | CCCma | historical | r11i1p1f1 | Amon | ua | gn | gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... | NaN | 20190429 |
| 1 | IPSL-CM6A-LR | CMIP | IPSL | historical | r2i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
Step 3: attach the new dataframe to our catalog object#
cat_subset.esmcat._df = df
cat_subset
GOOGLE-CMIP6 catalog with 2 dataset(s) from 2 asset(s):
| unique | |
|---|---|
| source_id | 2 |
| activity_id | 1 |
| institution_id | 2 |
| experiment_id | 1 |
| member_id | 2 |
| table_id | 1 |
| variable_id | 1 |
| grid_label | 2 |
| zstore | 2 |
| dcpp_init_year | 1 |
| version | 2 |
| derived_variable_id | 0 |
Let’s load the subsetted catalog into a dictionary of datasets:
dsets = cat_subset.to_dataset_dict()
[key for key in dsets]
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
['CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr',
'CMIP.CCCma.CanESM5.historical.Amon.gn']
dsets["CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr"]
<xarray.Dataset> Size: 3GB
Dimensions: (member_id: 1, dcpp_init_year: 1, time: 1980, plev: 19,
lat: 143, lon: 144, axis_nbounds: 2)
Coordinates:
* member_id (member_id) object 8B 'r2i1p1f1'
* dcpp_init_year (dcpp_init_year) float64 8B nan
* time (time) datetime64[ns] 16kB 1850-01-16T12:00:00 ... 2014-1...
* plev (plev) float32 76B 1e+05 9.25e+04 8.5e+04 ... 500.0 100.0
* lat (lat) float32 572B -90.0 -88.73 -87.46 ... 87.46 88.73 90.0
* lon (lon) float32 576B 0.0 2.5 5.0 7.5 ... 352.5 355.0 357.5
time_bounds (time, axis_nbounds) datetime64[ns] 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: axis_nbounds
Data variables:
ua (member_id, dcpp_init_year, time, plev, lat, lon) float32 3GB dask.array<chunksize=(1, 1, 60, 19, 143, 144), meta=np.ndarray>
Attributes: (12/65)
CMIP6_CV_version: cv=6.2.3.5-2-g63b123e
Conventions: CF-1.7 CMIP-6.2
EXPID: historical
activity_id: CMIP
branch_method: standard
branch_time_in_child: 0.0
... ...
intake_esm_attrs:variable_id: ua
intake_esm_attrs:grid_label: gr
intake_esm_attrs:zstore: gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR...
intake_esm_attrs:version: 20180803
intake_esm_attrs:_data_format_: zarr
intake_esm_dataset_key: CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.grUse case 2: save a catalog subset as a new catalog#
Another use case is to save a subset of the catalog as a new catalog. This is highly useful when you want to share a subset of the catalog or preserve a copy of the catalog for future use.
Tip
We highly recommend that you save the subset of the catalog which you use in your analysis. Remote catalogs can change over time, and you may want to preserve a copy of the original catalog to ensure reproducibility of your analysis.
To save a subset of the catalog as a new catalog, we can use the serialize() method:
import tempfile
directory = tempfile.gettempdir()
cat_subset.serialize(directory=directory, name="my_catalog_subset")
Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json
By default, the serialize() method will write a single JSON file containing the catalog subset.
!cat {directory}/my_catalog_subset.json
{
"esmcat_version": "0.1.0",
"attributes": [
{
"column_name": "activity_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
},
{
"column_name": "source_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
},
{
"column_name": "institution_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
},
{
"column_name": "experiment_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
},
{
"column_name": "member_id",
"vocabulary": ""
},
{
"column_name": "table_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
},
{
"column_name": "variable_id",
"vocabulary": ""
},
{
"column_name": "grid_label",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
},
{
"column_name": "version",
"vocabulary": ""
},
{
"column_name": "dcpp_start_year",
"vocabulary": ""
}
],
"assets": {
"column_name": "zstore",
"format": "zarr",
"format_column_name": null
},
"aggregation_control": {
"variable_column_name": "variable_id",
"groupby_attrs": [
"activity_id",
"institution_id",
"source_id",
"experiment_id",
"table_id",
"grid_label"
],
"aggregations": [
{
"type": "union",
"attribute_name": "variable_id",
"options": {}
},
{
"type": "join_new",
"attribute_name": "member_id",
"options": {
"coords": "minimal",
"compat": "override"
}
},
{
"type": "join_new",
"attribute_name": "dcpp_init_year",
"options": {
"coords": "minimal",
"compat": "override"
}
}
]
},
"id": "my_catalog_subset",
"description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
"title": null,
"last_updated": "2026-04-06T18:36:44Z",
"catalog_dict": [
{
"source_id": "CanESM5",
"activity_id": "CMIP",
"institution_id": "CCCma",
"experiment_id": "historical",
"member_id": "r11i1p1f1",
"table_id": "Amon",
"variable_id": "ua",
"grid_label": "gn",
"zstore": "gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/",
"dcpp_init_year": NaN,
"version": 20190429
},
{
"source_id": "IPSL-CM6A-LR",
"activity_id": "CMIP",
"institution_id": "IPSL",
"experiment_id": "historical",
"member_id": "r2i1p1f1",
"table_id": "Amon",
"variable_id": "ua",
"grid_label": "gr",
"zstore": "gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/",
"dcpp_init_year": NaN,
"version": 20180803
}
]
}
For large catalogs, we recommend that you write the catalog subset to its own CSV file. This can be achieved by setting catalog_type to file:
cat_subset.serialize(directory=directory, name="my_catalog_subset", catalog_type="file")
Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json
!cat {directory}/my_catalog_subset.json
!cat {directory}/my_catalog_subset.csv
{
"esmcat_version": "0.1.0",
"attributes": [
{
"column_name": "activity_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
},
{
"column_name": "source_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
},
{
"column_name": "institution_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
},
{
"column_name": "experiment_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
},
{
"column_name": "member_id",
"vocabulary": ""
},
{
"column_name": "table_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
},
{
"column_name": "variable_id",
"vocabulary": ""
},
{
"column_name": "grid_label",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
},
{
"column_name": "version",
"vocabulary": ""
},
{
"column_name": "dcpp_start_year",
"vocabulary": ""
}
],
"assets": {
"column_name": "zstore",
"format": "zarr",
"format_column_name": null
},
"aggregation_control": {
"variable_column_name": "variable_id",
"groupby_attrs": [
"activity_id",
"institution_id",
"source_id",
"experiment_id",
"table_id",
"grid_label"
],
"aggregations": [
{
"type": "union",
"attribute_name": "variable_id",
"options": {}
},
{
"type": "join_new",
"attribute_name": "member_id",
"options": {
"coords": "minimal",
"compat": "override"
}
},
{
"type": "join_new",
"attribute_name": "dcpp_init_year",
"options": {
"coords": "minimal",
"compat": "override"
}
}
]
},
"id": "my_catalog_subset",
"description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
"title": null,
"last_updated": "2026-04-06T18:36:44Z",
"catalog_file": "file:///tmp/my_catalog_subset.csv"
}
source_id,activity_id,institution_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
CanESM5,CMIP,CCCma,historical,r11i1p1f1,Amon,ua,gn,gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/,,20190429
IPSL-CM6A-LR,CMIP,IPSL,historical,r2i1p1f1,Amon,ua,gr,gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/,,20180803
Conclusion#
Intake-ESM provides a powerful search API, however, there are cases where you may want to modify the catalog by using pandas directly. In this notebook we showed how to do that and how to attach the modified dataframe to the catalog object and/or save the modified catalog as a new catalog.