import warnings warnings.filterwarnings("ignore") import intake
The in-memory representation of an Earth System Model (ESM) catalog is a pandas dataframe, and is accessible via the .df property:
.df
url = ( "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" ) col = intake.open_esm_datastore(url) col.df.head()
In this notebook we will go through some examples showing how to manipulate this dataframe outside of intake-esm.
Let’s say we are interested in datasets with the following attributes:
experiment_id=["historical"]
table_id="Amon"
variable_id="tas"
source_id=['TaiESM1', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'UKESM1-0-LL']
In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.
This can be achieved in two steps:
We can run a query against the catalog:
col_subset = col.search( experiment_id=["historical"], table_id="Amon", variable_id="tas", source_id=[ "TaiESM1", "AWI-CM-1-1-MR", "AWI-ESM-1-1-LR", "BCC-CSM2-MR", "BCC-ESM1", "CAMS-CSM1-0", "CAS-ESM2-0", "UKESM1-0-LL", ], ) col_subset
pangeo-cmip6 catalog with 9 dataset(s) from 38 asset(s):
member_id
source_id
The subsetted catalog contains source_id with the following number of member_id per source_id:
col_subset.df.groupby("source_id")["member_id"].nunique()
source_id AWI-CM-1-1-MR 5 AWI-ESM-1-1-LR 1 BCC-CSM2-MR 3 BCC-ESM1 3 CAMS-CSM1-0 3 CAS-ESM2-0 4 TaiESM1 1 UKESM1-0-LL 18 Name: member_id, dtype: int64
To get the first member_id for each source_id, we group the dataframe by source_id and use the .first() function to retrieve the first member_id:
.first()
grouped = col_subset.df.groupby(["source_id"]) df = grouped.first().reset_index() # Confirm that we have one ensemble member per source_id df.groupby("source_id")["member_id"].nunique()
source_id AWI-CM-1-1-MR 1 AWI-ESM-1-1-LR 1 BCC-CSM2-MR 1 BCC-ESM1 1 CAMS-CSM1-0 1 CAS-ESM2-0 1 TaiESM1 1 UKESM1-0-LL 1 Name: member_id, dtype: int64
col_subset.df = df col_subset
pangeo-cmip6 catalog with 8 dataset(s) from 8 asset(s):
dsets = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True}) [key for key in dsets]
--> The keys in the returned dictionary of datasets are constructed as follows: 'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'CMIP.AWI.AWI-CM-1-1-MR.historical.Amon.gn', 'CMIP.CAS.CAS-ESM2-0.historical.Amon.gn', 'CMIP.AWI.AWI-ESM-1-1-LR.historical.Amon.gn', 'CMIP.AS-RCEC.TaiESM1.historical.Amon.gn', 'CMIP.MOHC.UKESM1-0-LL.historical.Amon.gn', 'CMIP.BCC.BCC-ESM1.historical.Amon.gn', 'CMIP.CAMS.CAMS-CSM1-0.historical.Amon.gn']
print(dsets["CMIP.CAS.CAS-ESM2-0.historical.Amon.gn"])
<xarray.Dataset> Dimensions: (bnds: 2, lat: 128, lon: 256, member_id: 1, time: 1980) Coordinates: height float64 ... * lat (lat) float64 -90.0 -88.58 -87.17 -85.75 ... 87.17 88.58 90.0 lat_bnds (lat, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray> * lon (lon) float64 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6 lon_bnds (lon, bnds) float64 dask.array<chunksize=(256, 2), meta=np.ndarray> * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> * member_id (member_id) <U8 'r1i1p1f1' Dimensions without coordinates: bnds Data variables: tas (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 381, 128, 256), meta=np.ndarray> Attributes: Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 branch_time_in_parent: 0.0 cmor_version: 3.5.0 contact: Zhang He (zhanghe@mail.iap.ac.cn) creation_date: 2020-03-02T12:28:26Z data_specs_version: 01.00.31 experiment: all-forcing simulation of the recent past experiment_id: historical external_variables: areacella forcing_index: 1 frequency: mon further_info_url: https://furtherinfo.es-doc.org/CMIP6.CAS.CAS-ESM... grid: native atmosphere regular grid (128x256 latxlon) grid_label: gn history: 2020-03-02T12:28:26Z ;rewrote data to be consist... initialization_index: 1 institution: Chinese Academy of Sciences, Beijing 100029, China institution_id: CAS license: CMIP6 model data produced by Institute of Atmosp... mip_era: CMIP6 nominal_resolution: 100 km parent_activity_id: CMIP parent_experiment_id: piControl parent_mip_era: CMIP6 parent_source_id: CAS-ESM2-0 parent_time_units: days since 1850-01-01 parent_variant_label: r1i1p1f1 physics_index: 1 product: model-output realization_index: 1 realm: atmos run_variant: 3rd realization source: CAS-ESM 2.0 (2019): \naerosol: IAP AACM\natmos: ... source_id: CAS-ESM2-0 source_type: AOGCM status: 2020-05-02;created; by gcs.cmip6.ldeo@gmail.com sub_experiment: none sub_experiment_id: none table_id: Amon table_info: Creation Date:(24 July 2019) MD5:b9834a2d0728c0d... title: CAS-ESM2-0 output prepared for CMIP6 tracking_id: hdl:21.14100/22e89a1b-f73e-45be-84dc-7d0aabbeea9d variable_id: tas variant_label: r1i1p1f1 intake_esm_varname: ['tas'] intake_esm_dataset_key: CMIP.CAS.CAS-ESM2-0.historical.Amon.gn