Modify a catalog#

import intake

The in-memory representation of an Earth System Model (ESM) catalog is a Pandas DataFrame, and is accessible via the .df property:

url ="https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
cat = intake.open_esm_datastore(url)
cat.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CMIP IPSL IPSL-CM6A-LR historical r2i1p1f1 Amon va gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
1 CMIP IPSL IPSL-CM6A-LR historical r2i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
2 CMIP IPSL IPSL-CM6A-LR historical r8i1p1f1 Oyr o2 gn gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
3 CMIP IPSL IPSL-CM6A-LR historical r30i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803
4 CMIP IPSL IPSL-CM6A-LR historical r30i1p1f1 Amon va gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803

In this notebook we will go through some examples showing how to modify this dataframe and some of its behavior during data loading steps.

Note

Pandas is a powerful tool for data manipulation. If you are not familiar with it, we recommend you to read the Pandas documentation.

Use case 1: complex search queries#

Let’s say we are interested in datasets with the following attributes:

  • experiment_id=["historical"]

  • table_id="Amon"

  • variable_id="ua"

In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.

This can be achieved in two steps:

Step 1: run a query against the catalog#

We can run a query against the catalog:

cat_subset = cat.search(
    experiment_id=["historical"],
    table_id="Amon",
    variable_id="ua",
)
cat_subset

GOOGLE-CMIP6 catalog with 2 dataset(s) from 97 asset(s):

unique
activity_id 1
institution_id 2
source_id 2
experiment_id 1
member_id 72
table_id 1
variable_id 1
grid_label 2
zstore 97
dcpp_init_year 0
version 3
derived_variable_id 0

Step 2: select the first member_id for each source_id#

The subsetted catalog contains source_id with the following number of member_id per source_id:

cat_subset.df.groupby("source_id")["member_id"].nunique()
source_id
CanESM5         65
IPSL-CM6A-LR    32
Name: member_id, dtype: int64

To get the first member_id for each source_id, we group the dataframe by source_id and use the .first() method to retrieve the first member_id:

grouped = cat_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()

# Confirm that we have one ensemble member per source_id

df.groupby("source_id")["member_id"].nunique()
source_id
CanESM5         1
IPSL-CM6A-LR    1
Name: member_id, dtype: int64
df
source_id activity_id institution_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CanESM5 CMIP CCCma historical r11i1p1f1 Amon ua gn gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical... NaN 20190429
1 IPSL-CM6A-LR CMIP IPSL historical r2i1p1f1 Amon ua gr gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor... NaN 20180803

Step 3: attach the new dataframe to our catalog object#

cat_subset.esmcat._df = df
cat_subset

GOOGLE-CMIP6 catalog with 2 dataset(s) from 2 asset(s):

unique
source_id 2
activity_id 1
institution_id 2
experiment_id 1
member_id 2
table_id 1
variable_id 1
grid_label 2
zstore 2
dcpp_init_year 0
version 2
derived_variable_id 0

Let’s load the subsetted catalog into a dictionary of datasets:

dsets = cat_subset.to_dataset_dict()
[key for key in dsets]
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [2/2 00:16<00:00]
['CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr',
 'CMIP.CCCma.CanESM5.historical.Amon.gn']
dsets["CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr"]
<xarray.Dataset> Size: 3GB
Dimensions:         (lat: 143, lon: 144, plev: 19, time: 1980, axis_nbounds: 2,
                     member_id: 1, dcpp_init_year: 1)
Coordinates:
  * lat             (lat) float32 572B -90.0 -88.73 -87.46 ... 87.46 88.73 90.0
  * lon             (lon) float32 576B 0.0 2.5 5.0 7.5 ... 352.5 355.0 357.5
  * plev            (plev) float32 76B 1e+05 9.25e+04 8.5e+04 ... 500.0 100.0
  * time            (time) datetime64[ns] 16kB 1850-01-16T12:00:00 ... 2014-1...
    time_bounds     (time, axis_nbounds) datetime64[ns] 32kB dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id       (member_id) object 8B 'r2i1p1f1'
  * dcpp_init_year  (dcpp_init_year) float64 8B nan
Dimensions without coordinates: axis_nbounds
Data variables:
    ua              (member_id, dcpp_init_year, time, plev, lat, lon) float32 3GB dask.array<chunksize=(1, 1, 60, 19, 143, 144), meta=np.ndarray>
Attributes: (12/67)
    CMIP6_CV_version:                 cv=6.2.3.5-2-g63b123e
    Conventions:                      CF-1.7 CMIP-6.2
    EXPID:                            historical
    activity_id:                      CMIP
    branch_method:                    standard
    branch_time_in_child:             0.0
    ...                               ...
    intake_esm_attrs:variable_id:     ua
    intake_esm_attrs:grid_label:      gr
    intake_esm_attrs:zstore:          gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR...
    intake_esm_attrs:version:         20180803
    intake_esm_attrs:_data_format_:   zarr
    intake_esm_dataset_key:           CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr

Use case 2: save a catalog subset as a new catalog#

Another use case is to save a subset of the catalog as a new catalog. This is highly useful when you want to share a subset of the catalog or preserve a copy of the catalog for future use.

Tip

We highly recommend that you save the subset of the catalog which you use in your analysis. Remote catalogs can change over time, and you may want to preserve a copy of the original catalog to ensure reproducibility of your analysis.

To save a subset of the catalog as a new catalog, we can use the serialize() method:

import tempfile
directory = tempfile.gettempdir()
cat_subset.serialize(directory=directory, name="my_catalog_subset")
Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json

By default, the serialize() method will write a single JSON file containing the catalog subset.

!cat {directory}/my_catalog_subset.json
{
  "esmcat_version": "0.1.0",
  "attributes": [
    {
      "column_name": "activity_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
    {
      "column_name": "source_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
    },
    {
      "column_name": "institution_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
    },
    {
      "column_name": "experiment_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
    },
    {
      "column_name": "member_id",
      "vocabulary": ""
    },
    {
      "column_name": "table_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
    },
    {
      "column_name": "variable_id",
      "vocabulary": ""
    },
    {
      "column_name": "grid_label",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
    },
    {
      "column_name": "version",
      "vocabulary": ""
    },
    {
      "column_name": "dcpp_start_year",
      "vocabulary": ""
    }
  ],
  "assets": {
    "column_name": "zstore",
    "format": "zarr",
    "format_column_name": null
  },
  "aggregation_control": {
    "variable_column_name": "variable_id",
    "groupby_attrs": [
      "activity_id",
      "institution_id",
      "source_id",
      "experiment_id",
      "table_id",
      "grid_label"
    ],
    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable_id",
        "options": {}
      },
      {
        "type": "join_new",
        "attribute_name": "member_id",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      },
      {
        "type": "join_new",
        "attribute_name": "dcpp_init_year",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      }
    ]
  },
  "id": "my_catalog_subset",
  "description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "title": null,
  "last_updated": "2024-10-07T15:59:41Z",
  "catalog_dict": [
    {
      "source_id": "CanESM5",
      "activity_id": "CMIP",
      "institution_id": "CCCma",
      "experiment_id": "historical",
      "member_id": "r11i1p1f1",
      "table_id": "Amon",
      "variable_id": "ua",
      "grid_label": "gn",
      "zstore": "gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/",
      "dcpp_init_year": NaN,
      "version": 20190429
    },
    {
      "source_id": "IPSL-CM6A-LR",
      "activity_id": "CMIP",
      "institution_id": "IPSL",
      "experiment_id": "historical",
      "member_id": "r2i1p1f1",
      "table_id": "Amon",
      "variable_id": "ua",
      "grid_label": "gr",
      "zstore": "gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/",
      "dcpp_init_year": NaN,
      "version": 20180803
    }
  ]
}

For large catalogs, we recommend that you write the catalog subset to its own CSV file. This can be achieved by setting catalog_type to file:

cat_subset.serialize(directory=directory, name="my_catalog_subset", catalog_type="file")
Successfully wrote ESM catalog json file to: file:///tmp/my_catalog_subset.json
!cat {directory}/my_catalog_subset.json
!cat {directory}/my_catalog_subset.csv
{
  "esmcat_version": "0.1.0",
  "attributes": [
    {
      "column_name": "activity_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
    {
      "column_name": "source_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
    },
    {
      "column_name": "institution_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
    },
    {
      "column_name": "experiment_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
    },
    {
      "column_name": "member_id",
      "vocabulary": ""
    },
    {
      "column_name": "table_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
    },
    {
      "column_name": "variable_id",
      "vocabulary": ""
    },
    {
      "column_name": "grid_label",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
    },
    {
      "column_name": "version",
      "vocabulary": ""
    },
    {
      "column_name": "dcpp_start_year",
      "vocabulary": ""
    }
  ],
  "assets": {
    "column_name": "zstore",
    "format": "zarr",
    "format_column_name": null
  },
  "aggregation_control": {
    "variable_column_name": "variable_id",
    "groupby_attrs": [
      "activity_id",
      "institution_id",
      "source_id",
      "experiment_id",
      "table_id",
      "grid_label"
    ],
    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable_id",
        "options": {}
      },
      {
        "type": "join_new",
        "attribute_name": "member_id",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      },
      {
        "type": "join_new",
        "attribute_name": "dcpp_init_year",
        "options": {
          "coords": "minimal",
          "compat": "override"
        }
      }
    ]
  },
  "id": "my_catalog_subset",
  "description": "This is an ESM catalog for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "title": null,
  "last_updated": "2024-10-07T15:59:42Z",
  "catalog_file": "file:///tmp/my_catalog_subset.csv"
}
source_id,activity_id,institution_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year,version
CanESM5,CMIP,CCCma,historical,r11i1p1f1,Amon,ua,gn,gs://cmip6/CMIP6/CMIP/CCCma/CanESM5/historical/r11i1p1f1/Amon/ua/gn/v20190429/,,20190429
IPSL-CM6A-LR,CMIP,IPSL,historical,r2i1p1f1,Amon,ua,gr,gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/ua/gr/v20180803/,,20180803

Conclusion#

Intake-ESM provides a powerful search API, however, there are cases where you may want to modify the catalog by using pandas directly. In this notebook we showed how to do that and how to attach the modified dataframe to the catalog object and/or save the modified catalog as a new catalog.

Hide code cell source
import intake_esm
intake_esm.show_versions()
Hide code cell output
INSTALLED VERSIONS
------------------

cftime: 1.6.4
dask: 2024.9.1
fastprogress: 1.0.3
fsspec: 2024.9.0
gcsfs: 2024.9.0post1
intake: 0.7.0
intake_esm: 2024.2.6.post17+gecd3833.d20241007
netCDF4: 1.7.1
pandas: 2.2.3
requests: 2.32.3
s3fs: 2024.9.0
xarray: 2024.9.0
zarr: 2.18.3