Modify a catalog#
import intake
The in-memory representation of an Earth System Model (ESM) catalog is a Pandas DataFrame
, and is accessible via the .df
property:
url ="https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
cat = intake.open_esm_datastore(url)
cat.df.head()
activity_id | institution_id | source_id | experiment_id | member_id | table_id | variable_id | grid_label | zstore | dcpp_init_year | version | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | CMIP | IPSL | IPSL-CM6A-LR | historical | r2i1p1f1 | Amon | va | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
1 | CMIP | IPSL | IPSL-CM6A-LR | historical | r2i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
2 | CMIP | IPSL | IPSL-CM6A-LR | historical | r8i1p1f1 | Oyr | o2 | gn | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
3 | CMIP | IPSL | IPSL-CM6A-LR | historical | r30i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
4 | CMIP | IPSL | IPSL-CM6A-LR | historical | r30i1p1f1 | Amon | va | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
In this notebook we will go through some examples showing how to modify this dataframe and some of its behavior during data loading steps.
Note
Pandas is a powerful tool for data manipulation. If you are not familiar with it, we recommend you to read the Pandas documentation.
Use case 1: complex search queries#
Let’s say we are interested in datasets with the following attributes:
experiment_id=["historical"]
table_id="Amon"
variable_id="ua"
In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.
This can be achieved in two steps:
Step 1: run a query against the catalog#
We can run a query against the catalog:
cat_subset = cat.search(
experiment_id=["historical"],
table_id="Amon",
variable_id="ua",
)
cat_subset
GOOGLE-CMIP6 catalog with 2 dataset(s) from 97 asset(s):
unique | |
---|---|
activity_id | 1 |
institution_id | 2 |
source_id | 2 |
experiment_id | 1 |
member_id | 72 |
table_id | 1 |
variable_id | 1 |
grid_label | 2 |
zstore | 97 |
dcpp_init_year | 0 |
version | 3 |
derived_variable_id | 0 |
Step 2: select the first member_id
for each source_id
#
The subsetted catalog contains source_id
with the following number of
member_id
per source_id
:
cat_subset.df.groupby("source_id")["member_id"].nunique()
source_id
CanESM5 65
IPSL-CM6A-LR 32
Name: member_id, dtype: int64
To get the first member_id
for each source_id
, we group the dataframe by
source_id
and use the .first()
method to retrieve the first member_id
:
grouped = cat_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()
# Confirm that we have one ensemble member per source_id
df.groupby("source_id")["member_id"].nunique()
source_id
CanESM5 1
IPSL-CM6A-LR 1
Name: member_id, dtype: int64
df
source_id | activity_id | institution_id | experiment_id | member_id | table_id | variable_id | grid_label | zstore | dcpp_init_year | version | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | CanESM5 | CMIP | CCCma | historical | r11i1p1f1 | Amon | ua | gn | gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... | NaN | 20190429 |
1 | IPSL-CM6A-LR | CMIP | IPSL | historical | r2i1p1f1 | Amon | ua | gr | gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... | NaN | 20180803 |
Step 3: attach the new dataframe to our catalog object#
cat_subset.esmcat._df = df
cat_subset
GOOGLE-CMIP6 catalog with 2 dataset(s) from 2 asset(s):
unique | |
---|---|
source_id | 2 |
activity_id | 1 |
institution_id | 2 |
experiment_id | 1 |
member_id | 2 |
table_id | 1 |
variable_id | 1 |
grid_label | 2 |
zstore | 2 |
dcpp_init_year | 0 |
version | 2 |
derived_variable_id | 0 |
Let’s load the subsetted catalog into a dictionary of datasets:
dsets = cat_subset.to_dataset_dict()
[key for key in dsets]
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
['CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr',
'CMIP.CCCma.CanESM5.historical.Amon.gn']
dsets["CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr"]
<xarray.Dataset> Size: 3GB Dimensions: (lat: 143, lon: 144, plev: 19, time: 1980, axis_nbounds: 2, member_id: 1, dcpp_init_year: 1) Coordinates: * lat (lat) float32 572B -90.0 -88.73 -87.46 ... 87.46 88.73 90.0 * lon (lon) float32 576B 0.0 2.5 5.0 7.5 ... 352.5 355.0 357.5 * plev (plev) float32 76B 1e+05 9.25e+04 8.5e+04 ... 500.0 100.0 * time (time) datetime64[ns] 16kB 1850-01-16T12:00:00 ... 2014-1... time_bounds (time, axis_nbounds) datetime64[ns] 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray> * member_id (member_id) object 8B 'r2i1p1f1' * dcpp_init_year (dcpp_init_year) float64 8B nan Dimensions without coordinates: axis_nbounds Data variables: ua (member_id, dcpp_init_year, time, plev, lat, lon) float32 3GB dask.array<chunksize=(1, 1, 60, 19, 143, 144), meta=np.ndarray> Attributes: (12/67) CMIP6_CV_version: cv=6.2.3.5-2-g63b123e Conventions: CF-1.7 CMIP-6.2 EXPID: historical activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 ... ... intake_esm_attrs:variable_id: ua intake_esm_attrs:grid_label: gr intake_esm_attrs:zstore: gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR... intake_esm_attrs:version: 20180803 intake_esm_attrs:_data_format_: zarr intake_esm_dataset_key: CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr
Use case 2: save a catalog subset as a new catalog#
Another use case is to save a subset of the catalog as a new catalog. This is highly useful when you want to share a subset of the catalog or preserve a copy of the catalog for future use.
Tip
We highly recommend that you save the subset of the catalog which you use in your analysis. Remote catalogs can change over time, and you may want to preserve a copy of the original catalog to ensure reproducibility of your analysis.
To save a subset of the catalog as a new catalog, we can use the serialize()
method:
import tempfile
directory = tempfile.gettempdir()
cat_subset.serialize(directory=directory, name="my_catalog_subset")
Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json
By default, the serialize()
method will write a single JSON
file containing the catalog subset.
!cat {directory}/my_catalog_subset.json
{
"esmcat_version": "0.1.0",
"attributes": [
{
"column_name": "activity_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
},
{
"column_name": "source_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
},
{
"column_name": "institution_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
},
{
"column_name": "experiment_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
},
{
"column_name": "member_id",
"vocabulary": ""
},
{
"column_name": "table_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
},
{
"column_name": "variable_id",
"vocabulary": ""
},
{
"column_name": "grid_label",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
},
{
"column_name": "version",
"vocabulary": ""
},
{
"column_name": "dcpp_start_year",
"vocabulary": ""
}
],
"assets": {
"column_name": "zstore",
"format": "zarr",
"format_column_name": null
},
"aggregation_control": {
"variable_column_name": "variable_id",
"groupby_attrs": [
"activity_id",
"institution_id",
"source_id",
"experiment_id",
"table_id",
"grid_label"
],
"aggregations": [
{
"type": "union",
"attribute_name": "variable_id",
"options": {}
},
{
"type": "join_new",
"attribute_name": "member_id",
"options": {
"coords": "minimal",
"compat": "override"
}
},
{
"type": "join_new",
"attribute_name": "dcpp_init_year",
"options": {
"coords": "minimal",
"compat": "override"
}
}
]
},
"id": "my_catalog_subset",
"description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
"title": null,
"last_updated": "2024-10-07T15:59:41Z",
"catalog_dict": [
{
"source_id": "CanESM5",
"activity_id": "CMIP",
"institution_id": "CCCma",
"experiment_id": "historical",
"member_id": "r11i1p1f1",
"table_id": "Amon",
"variable_id": "ua",
"grid_label": "gn",
"zstore": "gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/",
"dcpp_init_year": NaN,
"version": 20190429
},
{
"source_id": "IPSL-CM6A-LR",
"activity_id": "CMIP",
"institution_id": "IPSL",
"experiment_id": "historical",
"member_id": "r2i1p1f1",
"table_id": "Amon",
"variable_id": "ua",
"grid_label": "gr",
"zstore": "gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/",
"dcpp_init_year": NaN,
"version": 20180803
}
]
}
For large catalogs, we recommend that you write the catalog subset to its own CSV
file. This can be achieved by setting catalog_type
to file
:
cat_subset.serialize(directory=directory, name="my_catalog_subset", catalog_type="file")
Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json
!cat {directory}/my_catalog_subset.json
!cat {directory}/my_catalog_subset.csv
{
"esmcat_version": "0.1.0",
"attributes": [
{
"column_name": "activity_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
},
{
"column_name": "source_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
},
{
"column_name": "institution_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
},
{
"column_name": "experiment_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
},
{
"column_name": "member_id",
"vocabulary": ""
},
{
"column_name": "table_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
},
{
"column_name": "variable_id",
"vocabulary": ""
},
{
"column_name": "grid_label",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
},
{
"column_name": "version",
"vocabulary": ""
},
{
"column_name": "dcpp_start_year",
"vocabulary": ""
}
],
"assets": {
"column_name": "zstore",
"format": "zarr",
"format_column_name": null
},
"aggregation_control": {
"variable_column_name": "variable_id",
"groupby_attrs": [
"activity_id",
"institution_id",
"source_id",
"experiment_id",
"table_id",
"grid_label"
],
"aggregations": [
{
"type": "union",
"attribute_name": "variable_id",
"options": {}
},
{
"type": "join_new",
"attribute_name": "member_id",
"options": {
"coords": "minimal",
"compat": "override"
}
},
{
"type": "join_new",
"attribute_name": "dcpp_init_year",
"options": {
"coords": "minimal",
"compat": "override"
}
}
]
},
"id": "my_catalog_subset",
"description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
"title": null,
"last_updated": "2024-10-07T15:59:42Z",
"catalog_file": "file:///tmp/my_catalog_subset.csv"
}
source_id,activity_id,institution_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
CanESM5,CMIP,CCCma,historical,r11i1p1f1,Amon,ua,gn,gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/,,20190429
IPSL-CM6A-LR,CMIP,IPSL,historical,r2i1p1f1,Amon,ua,gr,gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/,,20180803
Conclusion#
Intake-ESM provides a powerful search API, however, there are cases where you may want to modify the catalog by using pandas
directly. In this notebook we showed how to do that and how to attach the modified dataframe to the catalog object and/or save the modified catalog as a new catalog.
Show code cell source
import intake_esm
intake_esm.show_versions()
Show code cell output
INSTALLED VERSIONS
------------------
cftime: 1.6.4
dask: 2024.9.1
fastprogress: 1.0.3
fsspec: 2024.9.0
gcsfs: 2024.9.0post1
intake: 0.7.0
intake_esm: 2024.2.6.post17+gecd3833.d20241007
netCDF4: 1.7.1
pandas: 2.2.3
requests: 2.32.3
s3fs: 2024.9.0
xarray: 2024.9.0
zarr: 2.18.3