Manipulating DataFrame (in-memory catalog)¶
import warnings
warnings.filterwarnings("ignore")
import intake
The in-memory representation of an Earth System Model (ESM) catalog is a pandas
dataframe, and is accessible via the .df
property:
url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col.df.head()
activity_id | institution_id | source_id | experiment_id | member_id | table_id | variable_id | grid_label | zstore | dcpp_init_year | version | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | HighResMIP | CMCC | CMCC-CM2-HR4 | highresSST-present | r1i1p1f1 | Amon | hus | gn | gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... | NaN | 20170706 |
1 | HighResMIP | CMCC | CMCC-CM2-HR4 | highresSST-present | r1i1p1f1 | Amon | rsdt | gn | gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... | NaN | 20170706 |
2 | HighResMIP | CMCC | CMCC-CM2-HR4 | highresSST-present | r1i1p1f1 | Amon | prw | gn | gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... | NaN | 20170706 |
3 | HighResMIP | CMCC | CMCC-CM2-HR4 | highresSST-present | r1i1p1f1 | Amon | rlus | gn | gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... | NaN | 20170706 |
4 | HighResMIP | CMCC | CMCC-CM2-HR4 | highresSST-present | r1i1p1f1 | Amon | rlds | gn | gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/... | NaN | 20170706 |
In this notebook we will go through some examples showing how to manipulate this dataframe outside of intake-esm.
Use Case 1: Complex Search Queries¶
Let’s say we are interested in datasets with the following attributes:
experiment_id=["historical"]
table_id="Amon"
variable_id="tas"
source_id=['TaiESM1', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'UKESM1-0-LL']
In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.
This can be achieved in two steps:
Step 1: Run a query against the catalog¶
We can run a query against the catalog:
col_subset = col.search(
experiment_id=["historical"],
table_id="Amon",
variable_id="tas",
source_id=[
"TaiESM1",
"AWI-CM-1-1-MR",
"AWI-ESM-1-1-LR",
"BCC-CSM2-MR",
"BCC-ESM1",
"CAMS-CSM1-0",
"CAS-ESM2-0",
"UKESM1-0-LL",
],
)
col_subset
pangeo-cmip6 catalog with 9 dataset(s) from 40 asset(s):
unique | |
---|---|
activity_id | 1 |
institution_id | 7 |
source_id | 8 |
experiment_id | 1 |
member_id | 24 |
table_id | 1 |
variable_id | 1 |
grid_label | 1 |
zstore | 40 |
dcpp_init_year | 0 |
version | 26 |
Step 2: Select the first member_id
for each source_id
¶
The subsetted catalog contains source_id
with the following number of
member_id
per source_id
:
col_subset.df.groupby("source_id")["member_id"].nunique()
source_id
AWI-CM-1-1-MR 5
AWI-ESM-1-1-LR 1
BCC-CSM2-MR 3
BCC-ESM1 3
CAMS-CSM1-0 3
CAS-ESM2-0 4
TaiESM1 2
UKESM1-0-LL 19
Name: member_id, dtype: int64
To get the first member_id
for each source_id
, we group the dataframe by
source_id
and use the .first()
function to retrieve the first member_id
:
grouped = col_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()
# Confirm that we have one ensemble member per source_id
df.groupby("source_id")["member_id"].nunique()
source_id
AWI-CM-1-1-MR 1
AWI-ESM-1-1-LR 1
BCC-CSM2-MR 1
BCC-ESM1 1
CAMS-CSM1-0 1
CAS-ESM2-0 1
TaiESM1 1
UKESM1-0-LL 1
Name: member_id, dtype: int64
Step 3: Attach the new dataframe to our catalog object¶
col_subset.df = df
col_subset
pangeo-cmip6 catalog with 8 dataset(s) from 8 asset(s):
unique | |
---|---|
source_id | 8 |
activity_id | 1 |
institution_id | 6 |
experiment_id | 1 |
member_id | 4 |
table_id | 1 |
variable_id | 1 |
grid_label | 1 |
zstore | 8 |
dcpp_init_year | 0 |
version | 8 |
dsets = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
[key for key in dsets]
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn',
'CMIP.AS-RCEC.TaiESM1.historical.Amon.gn',
'CMIP.BCC.BCC-ESM1.historical.Amon.gn',
'CMIP.CAMS.CAMS-CSM1-0.historical.Amon.gn',
'CMIP.AWI.AWI-CM-1-1-MR.historical.Amon.gn',
'CMIP.MOHC.UKESM1-0-LL.historical.Amon.gn',
'CMIP.AWI.AWI-ESM-1-1-LR.historical.Amon.gn',
'CMIP.CAS.CAS-ESM2-0.historical.Amon.gn']
print(dsets["CMIP.CAS.CAS-ESM2-0.historical.Amon.gn"])
<xarray.Dataset>
Dimensions: (lat: 128, bnds: 2, lon: 256, member_id: 1, time: 1980)
Coordinates:
height float64 ...
* lat (lat) float64 -90.0 -88.58 -87.17 -85.75 ... 87.17 88.58 90.0
lat_bnds (lat, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
* lon (lon) float64 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
lon_bnds (lon, bnds) float64 dask.array<chunksize=(256, 2), meta=np.ndarray>
* time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
* member_id (member_id) <U8 'r1i1p1f1'
Dimensions without coordinates: bnds
Data variables:
tas (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 381, 128, 256), meta=np.ndarray>
Attributes: (12/51)
Conventions: CF-1.7 CMIP-6.2
activity_id: CMIP
branch_method: standard
branch_time_in_child: 0.0
branch_time_in_parent: 0.0
cmor_version: 3.5.0
... ...
variable_id: tas
variant_label: r1i1p1f1
netcdf_tracking_ids: hdl:21.14100/22e89a1b-f73e-45be-84dc-7d0aabbeea9d
version_id: v20200302
intake_esm_varname: ['tas']
intake_esm_dataset_key: CMIP.CAS.CAS-ESM2-0.historical.Amon.gn
import intake_esm # just to display version information
intake_esm.show_versions()
INSTALLED VERSIONS
------------------
cftime: 1.5.0
dask: 2021.08.0
fastprogress: 0.2.7
fsspec: 2021.07.0
gcsfs: 2021.07.0
intake: 0.6.3
intake_esm: 0.0.0
netCDF4: 1.5.7
pandas: 1.3.2
requests: 2.26.0
s3fs: 2021.07.0
xarray: 0.19.0
zarr: 2.8.3