Modify a catalog#

import intake

The in-memory representation of an Earth System Model (ESM) catalog is a Pandas DataFrame, and is accessible via the .df property:

url ="https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
cat = intake.open_esm_datastore(url)
cat.df.head()

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	CMIP	IPSL	IPSL-CM6A-LR	historical	r2i1p1f1	Amon	va	gr	gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...	NaN	20180803
1	CMIP	IPSL	IPSL-CM6A-LR	historical	r2i1p1f1	Amon	ua	gr	gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...	NaN	20180803
2	CMIP	IPSL	IPSL-CM6A-LR	historical	r8i1p1f1	Oyr	o2	gn	gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...	NaN	20180803
3	CMIP	IPSL	IPSL-CM6A-LR	historical	r30i1p1f1	Amon	ua	gr	gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...	NaN	20180803
4	CMIP	IPSL	IPSL-CM6A-LR	historical	r30i1p1f1	Amon	va	gr	gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...	NaN	20180803

In this notebook we will go through some examples showing how to modify this dataframe and some of its behavior during data loading steps.

Note

Pandas is a powerful tool for data manipulation. If you are not familiar with it, we recommend you to read the Pandas documentation.

Use case 1: complex search queries#

Let’s say we are interested in datasets with the following attributes:

experiment_id=["historical"]
table_id="Amon"
variable_id="ua"

In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.

This can be achieved in two steps:

Step 1: run a query against the catalog#

We can run a query against the catalog:

cat_subset = cat.search(
    experiment_id=["historical"],
    table_id="Amon",
    variable_id="ua",
)
cat_subset

GOOGLE-CMIP6 catalog with 2 dataset(s) from 97 asset(s):

	unique
activity_id	1
institution_id	2
source_id	2
experiment_id	1
member_id	72
table_id	1
variable_id	1
grid_label	2
zstore	97
dcpp_init_year	0
version	3
derived_variable_id	0

Step 2: select the first `member_id` for each `source_id`#

The subsetted catalog contains source_id with the following number of member_id per source_id:

cat_subset.df.groupby("source_id")["member_id"].nunique()

source_id
CanESM5         65
IPSL-CM6A-LR    32
Name: member_id, dtype: int64

To get the first member_id for each source_id, we group the dataframe by source_id and use the .first() method to retrieve the first member_id:

grouped = cat_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()

# Confirm that we have one ensemble member per source_id

df.groupby("source_id")["member_id"].nunique()

source_id
CanESM5         1
IPSL-CM6A-LR    1
Name: member_id, dtype: int64

df

	source_id	activity_id	institution_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	CanESM5	CMIP	CCCma	historical	r11i1p1f1	Amon	ua	gn	gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical...	NaN	20190429
1	IPSL-CM6A-LR	CMIP	IPSL	historical	r2i1p1f1	Amon	ua	gr	gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...	NaN	20180803

Step 3: attach the new dataframe to our catalog object#

cat_subset.esmcat._df = df
cat_subset

GOOGLE-CMIP6 catalog with 2 dataset(s) from 2 asset(s):

	unique
source_id	2
activity_id	1
institution_id	2
experiment_id	1
member_id	2
table_id	1
variable_id	1
grid_label	2
zstore	2
dcpp_init_year	0
version	2
derived_variable_id	0

Let’s load the subsetted catalog into a dictionary of datasets:

dsets = cat_subset.to_dataset_dict()
[key for key in dsets]

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

100.00% [2/2 00:00<00:00]

['CMIP.CCCma.CanESM5.historical.Amon.gn',
 'CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr']

dsets["CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr"]

<xarray.Dataset>
Dimensions:         (lat: 143, lon: 144, plev: 19, time: 1980, axis_nbounds: 2,
                     member_id: 1, dcpp_init_year: 1)
Coordinates:
  * lat             (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
  * lon             (lon) float32 0.0 2.5 5.0 7.5 ... 350.0 352.5 355.0 357.5
  * plev            (plev) float32 1e+05 9.25e+04 8.5e+04 ... 1e+03 500.0 100.0
  * time            (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T...
    time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id       (member_id) object 'r2i1p1f1'
  * dcpp_init_year  (dcpp_init_year) float64 nan
Dimensions without coordinates: axis_nbounds
Data variables:
    ua              (member_id, dcpp_init_year, time, plev, lat, lon) float32 dask.array<chunksize=(1, 1, 60, 19, 143, 144), meta=np.ndarray>
Attributes: (12/67)
    CMIP6_CV_version:                 cv=6.2.3.5-2-g63b123e
    Conventions:                      CF-1.7 CMIP-6.2
    EXPID:                            historical
    activity_id:                      CMIP
    branch_method:                    standard
    branch_time_in_child:             0.0
    ...                               ...
    intake_esm_attrs:variable_id:     ua
    intake_esm_attrs:grid_label:      gr
    intake_esm_attrs:zstore:          gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR...
    intake_esm_attrs:version:         20180803
    intake_esm_attrs:_data_format_:   zarr
    intake_esm_dataset_key:           CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr

Use case 2: save a catalog subset as a new catalog#

Another use case is to save a subset of the catalog as a new catalog. This is highly useful when you want to share a subset of the catalog or preserve a copy of the catalog for future use.

Tip

We highly recommend that you save the subset of the catalog which you use in your analysis. Remote catalogs can change over time, and you may want to preserve a copy of the original catalog to ensure reproducibility of your analysis.

To save a subset of the catalog as a new catalog, we can use the serialize() method:

import tempfile
directory = tempfile.gettempdir()
cat_subset.serialize(directory=directory, name="my_catalog_subset")

Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json

By default, the serialize() method will write a single JSON file containing the catalog subset.

!cat {directory}/my_catalog_subset.json

{
  "esmcat_version": "0.1.0",
  "attributes": [
    {
      "column_name": "activity_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
    {
      "column_name": "source_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
    },
    {
      "column_name": "institution_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
    },
    {
      "column_name": "experiment_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
    },
    {
      "column_name": "member_id",
      "vocabulary": ""
    },
    {
      "column_name": "table_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
    },
    {
      "column_name": "variable_id",
      "vocabulary": ""
    },
    {
      "column_name": "grid_label",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
    },
    {
      "column_name": "version",
      "vocabulary": ""
    },
    {
      "column_name": "dcpp_start_year",
      "vocabulary": ""
    }
  ],
  "assets": {
    "column_name": "zstore",
    "format": "zarr",
    "format_column_name": null
  },
  "aggregation_control": {
    "variable_column_name": "variable_id",
    "groupby_attrs": [
      "activity_id",
      "institution_id",
      "source_id",
      "experiment_id",
      "table_id",
      "grid_label"
    ],
    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable_id",
        "options": {}
      },
      {
        "type": "join_new",
        "attribute_name": "member_id",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      },
      {
        "type": "join_new",
        "attribute_name": "dcpp_init_year",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      }
    ]
  },
  "id": "my_catalog_subset",
  "description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "title": null,
  "last_updated": "2022-09-18T01:58:57Z",
  "catalog_dict": [
    {
      "source_id": "CanESM5",
      "activity_id": "CMIP",
      "institution_id": "CCCma",
      "experiment_id": "historical",
      "member_id": "r11i1p1f1",
      "table_id": "Amon",
      "variable_id": "ua",
      "grid_label": "gn",
      "zstore": "gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/",
      "dcpp_init_year": NaN,
      "version": 20190429
    },
    {
      "source_id": "IPSL-CM6A-LR",
      "activity_id": "CMIP",
      "institution_id": "IPSL",
      "experiment_id": "historical",
      "member_id": "r2i1p1f1",
      "table_id": "Amon",
      "variable_id": "ua",
      "grid_label": "gr",
      "zstore": "gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/",
      "dcpp_init_year": NaN,
      "version": 20180803
    }
  ]
}

For large catalogs, we recommend that you write the catalog subset to its own CSV file. This can be achieved by setting catalog_type to file:

cat_subset.serialize(directory=directory, name="my_catalog_subset", catalog_type="file")

Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json

!cat {directory}/my_catalog_subset.json
!cat {directory}/my_catalog_subset.csv

{
  "esmcat_version": "0.1.0",
  "attributes": [
    {
      "column_name": "activity_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
    {
      "column_name": "source_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
    },
    {
      "column_name": "institution_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
    },
    {
      "column_name": "experiment_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
    },
    {
      "column_name": "member_id",
      "vocabulary": ""
    },
    {
      "column_name": "table_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
    },
    {
      "column_name": "variable_id",
      "vocabulary": ""
    },
    {
      "column_name": "grid_label",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
    },
    {
      "column_name": "version",
      "vocabulary": ""
    },
    {
      "column_name": "dcpp_start_year",
      "vocabulary": ""
    }
  ],
  "assets": {
    "column_name": "zstore",
    "format": "zarr",
    "format_column_name": null
  },
  "aggregation_control": {
    "variable_column_name": "variable_id",
    "groupby_attrs": [
      "activity_id",
      "institution_id",
      "source_id",
      "experiment_id",
      "table_id",
      "grid_label"
    ],
    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable_id",
        "options": {}
      },
      {
        "type": "join_new",
        "attribute_name": "member_id",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      },
      {
        "type": "join_new",
        "attribute_name": "dcpp_init_year",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      }
    ]
  },
  "id": "my_catalog_subset",
  "description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "title": null,
  "last_updated": "2022-09-18T01:58:58Z",
  "catalog_file": "file:///tmp/my_catalog_subset.csv"
}

source_id,activity_id,institution_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
CanESM5,CMIP,CCCma,historical,r11i1p1f1,Amon,ua,gn,gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/,,20190429
IPSL-CM6A-LR,CMIP,IPSL,historical,r2i1p1f1,Amon,ua,gr,gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/,,20180803

Conclusion#

Intake-ESM provides a powerful search API, however, there are cases where you may want to modify the catalog by using pandas directly. In this notebook we showed how to do that and how to attach the modified dataframe to the catalog object and/or save the modified catalog as a new catalog.

import intake_esm
intake_esm.show_versions()

INSTALLED VERSIONS
------------------

cftime: 1.6.1
dask: 2022.6.1
fastprogress: 1.0.3
fsspec: 2022.8.2
gcsfs: 2022.8.2
intake: 0.6.6
intake_esm: 2022.9.18.post0+dirty
netCDF4: 1.6.1
pandas: 1.4.4
requests: 2.28.1
s3fs: 2022.8.2
xarray: 2022.6.0
zarr: 2.12.0

Modify a catalog#

Use case 1: complex search queries#

Step 1: run a query against the catalog#

Step 2: select the first member_id for each source_id#

Step 3: attach the new dataframe to our catalog object#

Use case 2: save a catalog subset as a new catalog#

Conclusion#

Step 2: select the first `member_id` for each `source_id`#