Manipulating DataFrame (in-memory catalog)

import warnings

warnings.filterwarnings("ignore")
import intake

The in-memory representation of an Earth System Model (ESM) catalog is a pandas dataframe, and is accessible via the .df property:

url = (
    "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
)
col = intake.open_esm_datastore(url)
col.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 AerChemMIP AS-RCEC TaiESM1 histSST r1i1p1f1 AERmon od550aer gn gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/... NaN 20200310
1 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrbc gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190718
2 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrdust gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127
3 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmroa gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190809
4 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrso4 gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127

In this notebook we will go through some examples showing how to manipulate this dataframe outside of intake-esm.

Use Case 1: Complex Search Queries

Let’s say we are interested in datasets with the following attributes:

  • experiment_id=["historical"]

  • table_id="Amon"

  • variable_id="tas"

  • source_id=['TaiESM1', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'UKESM1-0-LL']

In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.

This can be achieved in two steps:

Step 1: Run a query against the catalog

We can run a query against the catalog:

col_subset = col.search(
    experiment_id=["historical"],
    table_id="Amon",
    variable_id="tas",
    source_id=[
        "TaiESM1",
        "AWI-CM-1-1-MR",
        "AWI-ESM-1-1-LR",
        "BCC-CSM2-MR",
        "BCC-ESM1",
        "CAMS-CSM1-0",
        "CAS-ESM2-0",
        "UKESM1-0-LL",
    ],
)
col_subset

pangeo-cmip6 catalog with 9 dataset(s) from 38 asset(s):

unique
activity_id 1
institution_id 7
source_id 8
experiment_id 1
member_id 23
table_id 1
variable_id 1
grid_label 1
zstore 38
dcpp_init_year 0
version 24

Step 2: Select the first member_id for each source_id

The subsetted catalog contains source_id with the following number of member_id per source_id:

col_subset.df.groupby("source_id")["member_id"].nunique()
source_id
AWI-CM-1-1-MR      5
AWI-ESM-1-1-LR     1
BCC-CSM2-MR        3
BCC-ESM1           3
CAMS-CSM1-0        3
CAS-ESM2-0         4
TaiESM1            1
UKESM1-0-LL       18
Name: member_id, dtype: int64

To get the first member_id for each source_id, we group the dataframe by source_id and use the .first() function to retrieve the first member_id:

grouped = col_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()

# Confirm that we have one ensemble member per source_id

df.groupby("source_id")["member_id"].nunique()
source_id
AWI-CM-1-1-MR     1
AWI-ESM-1-1-LR    1
BCC-CSM2-MR       1
BCC-ESM1          1
CAMS-CSM1-0       1
CAS-ESM2-0        1
TaiESM1           1
UKESM1-0-LL       1
Name: member_id, dtype: int64

Step 3: Attach the new dataframe to our catalog object

col_subset.df = df
col_subset

pangeo-cmip6 catalog with 8 dataset(s) from 8 asset(s):

unique
source_id 8
activity_id 1
institution_id 6
experiment_id 1
member_id 2
table_id 1
variable_id 1
grid_label 1
zstore 8
dcpp_init_year 0
version 8
dsets = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
[key for key in dsets]
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [8/8 00:00<00:00]
['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn',
 'CMIP.AWI.AWI-CM-1-1-MR.historical.Amon.gn',
 'CMIP.CAS.CAS-ESM2-0.historical.Amon.gn',
 'CMIP.AWI.AWI-ESM-1-1-LR.historical.Amon.gn',
 'CMIP.AS-RCEC.TaiESM1.historical.Amon.gn',
 'CMIP.MOHC.UKESM1-0-LL.historical.Amon.gn',
 'CMIP.BCC.BCC-ESM1.historical.Amon.gn',
 'CMIP.CAMS.CAMS-CSM1-0.historical.Amon.gn']
print(dsets["CMIP.CAS.CAS-ESM2-0.historical.Amon.gn"])
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 128, lon: 256, member_id: 1, time: 1980)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 -90.0 -88.58 -87.17 -85.75 ... 87.17 88.58 90.0
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * lon        (lon) float64 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(256, 2), meta=np.ndarray>
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id  (member_id) <U8 'r1i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    tas        (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 381, 128, 256), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7 CMIP-6.2
    activity_id:             CMIP
    branch_method:           standard
    branch_time_in_child:    0.0
    branch_time_in_parent:   0.0
    cmor_version:            3.5.0
    contact:                 Zhang He (zhanghe@mail.iap.ac.cn)
    creation_date:           2020-03-02T12:28:26Z
    data_specs_version:      01.00.31
    experiment:              all-forcing simulation of the recent past
    experiment_id:           historical
    external_variables:      areacella
    forcing_index:           1
    frequency:               mon
    further_info_url:        https://furtherinfo.es-doc.org/CMIP6.CAS.CAS-ESM...
    grid:                    native atmosphere regular grid (128x256 latxlon)
    grid_label:              gn
    history:                 2020-03-02T12:28:26Z ;rewrote data to be consist...
    initialization_index:    1
    institution:             Chinese Academy of Sciences, Beijing 100029, China
    institution_id:          CAS
    license:                 CMIP6 model data produced by Institute of Atmosp...
    mip_era:                 CMIP6
    nominal_resolution:      100 km
    parent_activity_id:      CMIP
    parent_experiment_id:    piControl
    parent_mip_era:          CMIP6
    parent_source_id:        CAS-ESM2-0
    parent_time_units:       days since 1850-01-01
    parent_variant_label:    r1i1p1f1
    physics_index:           1
    product:                 model-output
    realization_index:       1
    realm:                   atmos
    run_variant:             3rd realization
    source:                  CAS-ESM 2.0 (2019): \naerosol: IAP AACM\natmos: ...
    source_id:               CAS-ESM2-0
    source_type:             AOGCM
    status:                  2020-05-02;created; by gcs.cmip6.ldeo@gmail.com
    sub_experiment:          none
    sub_experiment_id:       none
    table_id:                Amon
    table_info:              Creation Date:(24 July 2019) MD5:b9834a2d0728c0d...
    title:                   CAS-ESM2-0 output prepared for CMIP6
    tracking_id:             hdl:21.14100/22e89a1b-f73e-45be-84dc-7d0aabbeea9d
    variable_id:             tas
    variant_label:           r1i1p1f1
    intake_esm_varname:      ['tas']
    intake_esm_dataset_key:  CMIP.CAS.CAS-ESM2-0.historical.Amon.gn