Manipulating DataFrame (in-memory catalog)¶

import warnings

warnings.filterwarnings("ignore")
import intake

The in-memory representation of an Earth System Model (ESM) catalog is a pandas dataframe, and is accessible via the .df property:

url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col.df.head()

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	hus	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
1	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rsdt	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
2	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	prw	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
3	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rlus	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
4	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rlds	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706

In this notebook we will go through some examples showing how to manipulate this dataframe outside of intake-esm.

Use Case 1: Complex Search Queries¶

Let’s say we are interested in datasets with the following attributes:

experiment_id=["historical"]
table_id="Amon"
variable_id="tas"
source_id=['TaiESM1', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'UKESM1-0-LL']

In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.

This can be achieved in two steps:

Step 1: Run a query against the catalog¶

We can run a query against the catalog:

col_subset = col.search(
    experiment_id=["historical"],
    table_id="Amon",
    variable_id="tas",
    source_id=[
        "TaiESM1",
        "AWI-CM-1-1-MR",
        "AWI-ESM-1-1-LR",
        "BCC-CSM2-MR",
        "BCC-ESM1",
        "CAMS-CSM1-0",
        "CAS-ESM2-0",
        "UKESM1-0-LL",
    ],
)
col_subset

pangeo-cmip6 catalog with 9 dataset(s) from 40 asset(s):

	unique
activity_id	1
institution_id	7
source_id	8
experiment_id	1
member_id	24
table_id	1
variable_id	1
grid_label	1
zstore	40
dcpp_init_year	0
version	26

Step 2: Select the first `member_id` for each `source_id`¶

The subsetted catalog contains source_id with the following number of member_id per source_id:

col_subset.df.groupby("source_id")["member_id"].nunique()

source_id
AWI-CM-1-1-MR      5
AWI-ESM-1-1-LR     1
BCC-CSM2-MR        3
BCC-ESM1           3
CAMS-CSM1-0        3
CAS-ESM2-0         4
TaiESM1            2
UKESM1-0-LL       19
Name: member_id, dtype: int64

To get the first member_id for each source_id, we group the dataframe by source_id and use the .first() function to retrieve the first member_id:

grouped = col_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()

# Confirm that we have one ensemble member per source_id

df.groupby("source_id")["member_id"].nunique()

source_id
AWI-CM-1-1-MR     1
AWI-ESM-1-1-LR    1
BCC-CSM2-MR       1
BCC-ESM1          1
CAMS-CSM1-0       1
CAS-ESM2-0        1
TaiESM1           1
UKESM1-0-LL       1
Name: member_id, dtype: int64

Step 3: Attach the new dataframe to our catalog object¶

col_subset.df = df
col_subset

pangeo-cmip6 catalog with 8 dataset(s) from 8 asset(s):

	unique
source_id	8
activity_id	1
institution_id	6
experiment_id	1
member_id	4
table_id	1
variable_id	1
grid_label	1
zstore	8
dcpp_init_year	0
version	8

dsets = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
[key for key in dsets]

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

100.00% [8/8 00:01<00:00]

['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn',
 'CMIP.AS-RCEC.TaiESM1.historical.Amon.gn',
 'CMIP.BCC.BCC-ESM1.historical.Amon.gn',
 'CMIP.CAMS.CAMS-CSM1-0.historical.Amon.gn',
 'CMIP.AWI.AWI-CM-1-1-MR.historical.Amon.gn',
 'CMIP.MOHC.UKESM1-0-LL.historical.Amon.gn',
 'CMIP.AWI.AWI-ESM-1-1-LR.historical.Amon.gn',
 'CMIP.CAS.CAS-ESM2-0.historical.Amon.gn']

print(dsets["CMIP.CAS.CAS-ESM2-0.historical.Amon.gn"])

<xarray.Dataset>
Dimensions:    (lat: 128, bnds: 2, lon: 256, member_id: 1, time: 1980)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 -90.0 -88.58 -87.17 -85.75 ... 87.17 88.58 90.0
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * lon        (lon) float64 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(256, 2), meta=np.ndarray>
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * member_id  (member_id) <U8 'r1i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    tas        (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 381, 128, 256), meta=np.ndarray>
Attributes: (12/51)
    Conventions:             CF-1.7 CMIP-6.2
    activity_id:             CMIP
    branch_method:           standard
    branch_time_in_child:    0.0
    branch_time_in_parent:   0.0
    cmor_version:            3.5.0
    ...                      ...
    variable_id:             tas
    variant_label:           r1i1p1f1
    netcdf_tracking_ids:     hdl:21.14100/22e89a1b-f73e-45be-84dc-7d0aabbeea9d
    version_id:              v20200302
    intake_esm_varname:      ['tas']
    intake_esm_dataset_key:  CMIP.CAS.CAS-ESM2-0.historical.Amon.gn

import intake_esm  # just to display version information

intake_esm.show_versions()

INSTALLED VERSIONS
------------------

cftime: 1.5.0
dask: 2021.08.0
fastprogress: 0.2.7
fsspec: 2021.07.0
gcsfs: 2021.07.0
intake: 0.6.3
intake_esm: 0.0.0
netCDF4: 1.5.7
pandas: 1.3.2
requests: 2.26.0
s3fs: 2021.07.0
xarray: 0.19.0
zarr: 2.8.3

Manipulating DataFrame (in-memory catalog)¶

Use Case 1: Complex Search Queries¶

Step 1: Run a query against the catalog¶

Step 2: Select the first member_id for each source_id¶

Step 3: Attach the new dataframe to our catalog object¶

Step 2: Select the first `member_id` for each `source_id`¶