Load CMIP6 Data with Intake ESM — Intake-ESM 2020.12.18 documentation

Loading a catalog¶

import warnings

warnings.filterwarnings("ignore")
import intake

url = (
    "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
)
col = intake.open_esm_datastore(url)
col

pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):

	unique
activity_id	17
institution_id	35
source_id	84
experiment_id	160
member_id	549
table_id	37
variable_id	707
grid_label	10
zstore	402033
dcpp_init_year	60
version	606

The summary above tells us that this catalog contains over 268,000 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:

Catalog Contents¶

col.df.head()

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	AerChemMIP	AS-RCEC	TaiESM1	histSST	r1i1p1f1	AERmon	od550aer	gn	gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/...	NaN	20200310
1	AerChemMIP	BCC	BCC-ESM1	histSST	r1i1p1f1	AERmon	mmrbc	gn	gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...	NaN	20190718
2	AerChemMIP	BCC	BCC-ESM1	histSST	r1i1p1f1	AERmon	mmrdust	gn	gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...	NaN	20191127
3	AerChemMIP	BCC	BCC-ESM1	histSST	r1i1p1f1	AERmon	mmroa	gn	gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...	NaN	20190809
4	AerChemMIP	BCC	BCC-ESM1	histSST	r1i1p1f1	AERmon	mmrso4	gn	gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...	NaN	20191127

The first data asset listed in the catalog contains:

the ambient aerosol optical thickness at 550nm (variable_id='od550aer'), as a function of latitude, longitude, time,
in an individual climate model experiment with the Taiwan Earth System Model 1.0 model (source_id='TaiESM1'),
forced by the Historical transient with SSTs prescribed from historical experiment (experiment_id='histSST'),
developed by the Taiwan Research Center for Environmental Changes (instution_id='AS-RCEC'),
run as part of the Aerosols and Chemistry Model Intercomparison Project (activity_id='AerChemMIP')

And is located in Google Cloud Storage at gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/.

Finding unique entries¶

Let’s query the data to see what models (source_id), experiments (experiment_id) and temporal frequencies (table_id) are available.

import pprint

uni_dict = col.unique(["source_id", "experiment_id", "table_id"])
pprint.pprint(uni_dict, compact=True)

{'experiment_id': {'count': 160,
                   'values': ['1pctCO2', '1pctCO2-bgc', '1pctCO2-cdr',
                              '1pctCO2-rad', 'abrupt-0p5xCO2', 'abrupt-2xCO2',
                              'abrupt-4xCO2', 'abrupt-solm4p', 'abrupt-solp4p',
                              'amip', 'amip-4xCO2', 'amip-future4K',
                              'amip-hist', 'amip-lwoff', 'amip-m4K', 'amip-p4K',
                              'amip-p4K-lwoff', 'aqua-4xCO2', 'aqua-control',
                              'aqua-control-lwoff', 'aqua-p4K',
                              'aqua-p4K-lwoff', 'control-1950', 'dcppA-assim',
                              'dcppA-hindcast', 'dcppC-amv-ExTrop-neg',
                              'dcppC-amv-ExTrop-pos', 'dcppC-amv-Trop-neg',
                              'dcppC-amv-Trop-pos', 'dcppC-amv-neg',
                              'dcppC-amv-pos', 'dcppC-atl-control',
                              'dcppC-atl-pacemaker', 'dcppC-hindcast-noAgung',
                              'dcppC-hindcast-noElChichon',
                              'dcppC-hindcast-noPinatubo',
                              'dcppC-ipv-NexTrop-neg', 'dcppC-ipv-NexTrop-pos',
                              'dcppC-ipv-neg', 'dcppC-ipv-pos',
                              'dcppC-pac-control', 'dcppC-pac-pacemaker',
                              'deforest-globe', 'esm-hist', 'esm-pi-CO2pulse',
                              'esm-pi-cdr-pulse', 'esm-piControl',
                              'esm-piControl-spinup', 'esm-ssp585',
                              'esm-ssp585-ssp126Lu', 'faf-all', 'faf-heat',
                              'faf-heat-NA0pct', 'faf-heat-NA50pct',
                              'faf-passiveheat', 'faf-stress', 'faf-water',
                              'futSST-pdSIC', 'highresSST-future',
                              'highresSST-present', 'hist-1950', 'hist-1950HC',
                              'hist-CO2', 'hist-GHG', 'hist-GHG-cmip5',
                              'hist-aer', 'hist-aer-cmip5', 'hist-bgc',
                              'hist-nat', 'hist-nat-cmip5', 'hist-noLu',
                              'hist-piAer', 'hist-piNTCF', 'hist-resIPO',
                              'hist-sol', 'hist-stratO3', 'hist-totalO3',
                              'hist-volc', 'histSST', 'histSST-1950HC',
                              'histSST-piAer', 'histSST-piCH4',
                              'histSST-piNTCF', 'histSST-piO3', 'historical',
                              'historical-cmip5', 'historical-ext', 'land-hist',
                              'land-hist-altStartYear', 'land-noLu', 'lgm',
                              'lig127k', 'midHolocene', 'omip1', 'pa-futArcSIC',
                              'pa-pdSIC', 'past1000', 'pdSST-futAntSIC',
                              'pdSST-futArcSIC', 'pdSST-pdSIC',
                              'pdSST-piAntSIC', 'pdSST-piArcSIC',
                              'piClim-2xDMS', 'piClim-2xNOx', 'piClim-2xVOC',
                              'piClim-2xdust', 'piClim-2xfire', 'piClim-2xss',
                              'piClim-4xCO2', 'piClim-BC', 'piClim-CH4',
                              'piClim-HC', 'piClim-N2O', 'piClim-NOx',
                              'piClim-NTCF', 'piClim-O3', 'piClim-OC',
                              'piClim-SO2', 'piClim-VOC', 'piClim-aer',
                              'piClim-anthro', 'piClim-control', 'piClim-ghg',
                              'piClim-histaer', 'piClim-histall',
                              'piClim-histghg', 'piClim-histnat', 'piClim-lu',
                              'piControl', 'piControl-cmip5',
                              'piControl-spinup', 'piSST-pdSIC', 'piSST-piSIC',
                              'rcp26-cmip5', 'rcp45-cmip5', 'rcp85-cmip5',
                              'ssp119', 'ssp126', 'ssp126-ssp370Lu', 'ssp245',
                              'ssp245-GHG', 'ssp245-aer', 'ssp245-cov-fossil',
                              'ssp245-cov-modgreen', 'ssp245-cov-strgreen',
                              'ssp245-covid', 'ssp245-nat', 'ssp245-stratO3',
                              'ssp370', 'ssp370-lowNTCF', 'ssp370-ssp126Lu',
                              'ssp370SST', 'ssp370SST-lowCH4',
                              'ssp370SST-lowNTCF', 'ssp370SST-ssp126Lu',
                              'ssp370pdSST', 'ssp434', 'ssp460', 'ssp534-over',
                              'ssp585']},
 'source_id': {'count': 84,
               'values': ['ACCESS-CM2', 'ACCESS-ESM1-5', 'AWI-CM-1-1-MR',
                          'AWI-ESM-1-1-LR', 'BCC-CSM2-HR', 'BCC-CSM2-MR',
                          'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0',
                          'CESM1-1-CAM5-CMIP5', 'CESM2', 'CESM2-FV2',
                          'CESM2-WACCM', 'CESM2-WACCM-FV2', 'CIESM',
                          'CMCC-CM2-HR4', 'CMCC-CM2-SR5', 'CMCC-CM2-VHR4',
                          'CMCC-ESM2', 'CNRM-CM6-1', 'CNRM-CM6-1-HR',
                          'CNRM-ESM2-1', 'CanESM5', 'CanESM5-CanOE', 'E3SM-1-0',
                          'E3SM-1-1', 'E3SM-1-1-ECA', 'EC-Earth3',
                          'EC-Earth3-AerChem', 'EC-Earth3-CC', 'EC-Earth3-LR',
                          'EC-Earth3-Veg', 'EC-Earth3-Veg-LR', 'EC-Earth3P',
                          'EC-Earth3P-HR', 'EC-Earth3P-VHR', 'ECMWF-IFS-HR',
                          'ECMWF-IFS-LR', 'FGOALS-f3-H', 'FGOALS-f3-L',
                          'FGOALS-g3', 'FIO-ESM-2-0', 'GFDL-AM4', 'GFDL-CM4',
                          'GFDL-CM4C192', 'GFDL-ESM2M', 'GFDL-ESM4',
                          'GFDL-OM4p5B', 'GISS-E2-1-G', 'GISS-E2-1-G-CC',
                          'GISS-E2-1-H', 'GISS-E2-2-G', 'HadGEM3-GC31-HM',
                          'HadGEM3-GC31-LL', 'HadGEM3-GC31-LM',
                          'HadGEM3-GC31-MM', 'IITM-ESM', 'INM-CM4-8',
                          'INM-CM5-0', 'INM-CM5-H', 'IPSL-CM5A2-INCA',
                          'IPSL-CM6A-ATM-HR', 'IPSL-CM6A-LR',
                          'IPSL-CM6A-LR-INCA', 'KACE-1-0-G', 'KIOST-ESM',
                          'MCM-UA-1-0', 'MIROC-ES2L', 'MIROC6',
                          'MPI-ESM-1-2-HAM', 'MPI-ESM1-2-HR', 'MPI-ESM1-2-LR',
                          'MPI-ESM1-2-XR', 'MRI-AGCM3-2-H', 'MRI-AGCM3-2-S',
                          'MRI-ESM2-0', 'NESM3', 'NorCPM1', 'NorESM1-F',
                          'NorESM2-LM', 'NorESM2-MM', 'SAM0-UNICON', 'TaiESM1',
                          'UKESM1-0-LL']},
 'table_id': {'count': 37,
              'values': ['3hr', '6hrLev', '6hrPlev', '6hrPlevPt', 'AERday',
                         'AERhr', 'AERmon', 'AERmonZ', 'Aclim', 'Amon', 'CF3hr',
                         'CFday', 'CFmon', 'E1hrClimMon', 'E3hr', 'Eclim',
                         'Eday', 'EdayZ', 'Efx', 'Emon', 'EmonZ', 'Eyr',
                         'IfxGre', 'ImonGre', 'LImon', 'Lmon', 'Oclim', 'Oday',
                         'Odec', 'Ofx', 'Omon', 'Oyr', 'SIclim', 'SIday',
                         'SImon', 'day', 'fx']}}

Searching for specific datasets¶

In the example below, we are are going to search for the following:

variables: o2 which stands for mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
experiments: ['historical', 'ssp585']:
- historical: all forcing of the recent past.
- ssp585: emission-driven RCP8.5 based on SSP5.
table_id: Oyr which stands for annual mean variables on the ocean grid.
grid_label: gn which stands for data reported on a model’s native grid.

For more details on the CMIP6 vocabulary, please check this website, and Core Controlled Vocabularies (CVs) for use in CMIP6 GitHub repository.

cat = col.search(
    experiment_id=["historical", "ssp585"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
)

cat

pangeo-cmip6 catalog with 23 dataset(s) from 149 asset(s):

	unique
activity_id	2
institution_id	11
source_id	12
experiment_id	2
member_id	41
table_id	1
variable_id	1
grid_label	1
zstore	149
dcpp_init_year	0
version	20

cat.df.head()

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	CMIP	CCCma	CanESM5-CanOE	historical	r1i1p2f1	Oyr	o2	gn	gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical...	NaN	20190429
1	CMIP	CCCma	CanESM5-CanOE	historical	r2i1p2f1	Oyr	o2	gn	gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical...	NaN	20190429
2	CMIP	CCCma	CanESM5-CanOE	historical	r3i1p2f1	Oyr	o2	gn	gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical...	NaN	20190429
3	CMIP	CCCma	CanESM5	historical	r10i1p1f1	Oyr	o2	gn	gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1...	NaN	20190429
4	CMIP	CCCma	CanESM5	historical	r10i1p2f1	Oyr	o2	gn	gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1...	NaN	20190429

Loading datasets Using `to_dataset_dict()`¶

dset_dict = cat.to_dataset_dict(
    zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

100.00% [23/23 00:08<00:00]

[key for key in dset_dict.keys()]

['ScenarioMIP.DWD.MPI-ESM1-2-HR.ssp585.Oyr.gn',
 'CMIP.MRI.MRI-ESM2-0.historical.Oyr.gn',
 'ScenarioMIP.MRI.MRI-ESM2-0.ssp585.Oyr.gn',
 'CMIP.MPI-M.MPI-ESM1-2-HR.historical.Oyr.gn',
 'CMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.historical.Oyr.gn',
 'CMIP.CCCma.CanESM5.historical.Oyr.gn',
 'CMIP.NCC.NorESM2-MM.historical.Oyr.gn',
 'ScenarioMIP.MPI-M.MPI-ESM1-2-LR.ssp585.Oyr.gn',
 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Oyr.gn',
 'CMIP.MPI-M.MPI-ESM1-2-LR.historical.Oyr.gn',
 'CMIP.NCC.NorESM2-LM.historical.Oyr.gn',
 'ScenarioMIP.MIROC.MIROC-ES2L.ssp585.Oyr.gn',
 'ScenarioMIP.NCC.NorESM2-LM.ssp585.Oyr.gn',
 'ScenarioMIP.NCC.NorESM2-MM.ssp585.Oyr.gn',
 'CMIP.CCCma.CanESM5-CanOE.historical.Oyr.gn',
 'ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp585.Oyr.gn',
 'ScenarioMIP.NCAR.CESM2.ssp585.Oyr.gn',
 'CMIP.MIROC.MIROC-ES2L.historical.Oyr.gn',
 'CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn',
 'ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp585.Oyr.gn',
 'CMIP.CSIRO.ACCESS-ESM1-5.historical.Oyr.gn',
 'ScenarioMIP.CCCma.CanESM5-CanOE.ssp585.Oyr.gn',
 'ScenarioMIP.CCCma.CanESM5.ssp585.Oyr.gn']

We can access a particular dataset as follows:

ds = dset_dict["CMIP.CCCma.CanESM5.historical.Oyr.gn"]
print(ds)

<xarray.Dataset>
Dimensions:             (bnds: 2, i: 360, j: 291, lev: 45, member_id: 35, time: 165, vertices: 4)
Coordinates:
  * i                   (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359
  * j                   (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290
    latitude            (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * lev                 (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
    lev_bnds            (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
    longitude           (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time                (time) object 1850-07-02 12:00:00 ... 2014-07-02 12:0...
    time_bnds           (time, bnds) object dask.array<chunksize=(165, 2), meta=np.ndarray>
  * member_id           (member_id) <U9 'r10i1p1f1' 'r10i1p2f1' ... 'r9i1p2f1'
Dimensions without coordinates: bnds, vertices
Data variables:
    o2                  (member_id, time, lev, j, i) float32 dask.array<chunksize=(1, 12, 45, 291, 360), meta=np.ndarray>
    vertices_latitude   (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    vertices_longitude  (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
Attributes:
    table_id:                    Oyr
    variable_id:                 o2
    sub_experiment_id:           none
    activity_id:                 CMIP
    CCCma_runid:                 p2-his09
    CCCma_model_hash:            Unknown
    version:                     v20190429
    creation_date:               2019-05-30T08:58:45Z
    parent_time_units:           days since 1850-01-01 0:0:0.0
    references:                  Geophysical Model Development Special issue ...
    variant_label:               r9i1p2f1
    product:                     model-output
    cmor_version:                3.4.0
    source:                      CanESM5 (2019): \naerosol: interactive\natmo...
    source_id:                   CanESM5
    YMDH_branch_time_in_child:   1850:01:01:00
    tracking_id:                 hdl:21.14100/41426118-701c-482b-ae16-82932e4...
    Conventions:                 CF-1.7 CMIP-6.2
    external_variables:          areacello volcello
    parent_source_id:            CanESM5
    title:                       CanESM5 output prepared for CMIP6
    further_info_url:            https://furtherinfo.es-doc.org/CMIP6.CCCma.C...
    parent_mip_era:              CMIP6
    intake_esm_varname:          ['o2']
    parent_activity_id:          CMIP
    experiment_id:               historical
    realization_index:           9
    parent_experiment_id:        piControl
    source_type:                 AOGCM
    mip_era:                     CMIP6
    frequency:                   yr
    grid:                        ORCA1 tripolar grid, 1 deg with refinement t...
    CCCma_parent_runid:          p2-pictrl
    status:                      2019-10-25;created;by nhn2@columbia.edu
    license:                     CMIP6 model data produced by The Government ...
    YMDH_branch_time_in_parent:  5950:01:01:00
    branch_method:               Spin-up documentation
    contact:                     ec.cccma.info-info.ccmac.ec@canada.ca
    table_info:                  Creation Date:(20 February 2019) MD5:374fbe5...
    sub_experiment:              none
    forcing_index:               1
    branch_time_in_child:        0.0
    institution:                 Canadian Centre for Climate Modelling and An...
    realm:                       ocnBgchem
    branch_time_in_parent:       1496500.0
    experiment:                  all-forcing simulation of the recent past
    institution_id:              CCCma
    history:                     2019-05-02T13:53:53Z ;rewrote data to be con...
    data_specs_version:          01.00.29
    grid_label:                  gn
    nominal_resolution:          100 km
    initialization_index:        1
    intake_esm_dataset_key:      CMIP.CCCma.CanESM5.historical.Oyr.gn

Let’s create a quick plot for a slice of the data:

ds.o2.isel(time=0, lev=0, member_id=range(1, 24, 4)).plot(col="member_id", col_wrap=3, robust=True)

<xarray.plot.facetgrid.FacetGrid at 0x7ff83c2c7190>

Using custom preprocessing functions¶

When comparing many models it is often necessary to preprocess (e.g. rename certain variables) them before running some analysis step. The preprocess argument lets the user pass a function, which is executed for each loaded asset before aggregations.

cat_pp = col.search(
    experiment_id=["historical"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
    source_id=["IPSL-CM6A-LR", "CanESM5"],
    member_id="r10i1p1f1",
)
cat_pp.df

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	CMIP	CCCma	CanESM5	historical	r10i1p1f1	Oyr	o2	gn	gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1...	NaN	20190429
1	CMIP	IPSL	IPSL-CM6A-LR	historical	r10i1p1f1	Oyr	o2	gn	gs://cmip6/CMIP/IPSL/IPSL-CM6A-LR/historical/r...	NaN	20180803

# load the example
dset_dict_raw = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True})

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

100.00% [2/2 00:00<00:00]

for k, ds in dset_dict_raw.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn
	dimensions=['axis_nbounds', 'member_id', 'nvertex', 'olevel', 'time', 'x', 'y']

dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn
	dimensions=['bnds', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']

Note

Note that both models follow a different naming scheme. We can define a little helper function and pass it to .to_dataset_dict() to fix this. For demonstration purposes we will focus on the vertical level dimension which is called lev in CanESM5 and olevel in IPSL-CM6A-LR.

def helper_func(ds):
    """Rename `olevel` dim to `lev`"""
    ds = ds.copy()
    # a short example
    if "olevel" in ds.dims:
        ds = ds.rename({"olevel": "lev"})
    return ds

dset_dict_fixed = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True}, preprocess=helper_func)

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

100.00% [2/2 00:00<00:00]

for k, ds in dset_dict_fixed.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn
	dimensions=['axis_nbounds', 'lev', 'member_id', 'nvertex', 'time', 'x', 'y']

dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn
	dimensions=['bnds', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']

This was just an example for one dimension.

Note

Check out cmip6-preprocessing package for a full renaming function for all available CMIP6 models and some other utilities.