Load CMIP6 Data with Intake ESM

This notebook demonstrates how to access Google Cloud CMIP6 data using intake-esm.

Loading a catalog

import warnings

warnings.filterwarnings("ignore")
import intake
url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col

pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):

unique
activity_id 17
institution_id 35
source_id 84
experiment_id 160
member_id 549
table_id 37
variable_id 707
grid_label 10
zstore 402033
dcpp_init_year 60
version 606

The summary above tells us that this catalog contains over 268,000 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:

Catalog Contents

col.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 AerChemMIP AS-RCEC TaiESM1 histSST r1i1p1f1 AERmon od550aer gn gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/... NaN 20200310
1 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrbc gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190718
2 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrdust gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127
3 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmroa gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190809
4 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrso4 gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127

The first data asset listed in the catalog contains:

  • the ambient aerosol optical thickness at 550nm (variable_id='od550aer'), as a function of latitude, longitude, time,

  • in an individual climate model experiment with the Taiwan Earth System Model 1.0 model (source_id='TaiESM1'),

  • forced by the Historical transient with SSTs prescribed from historical experiment (experiment_id='histSST'),

  • developed by the Taiwan Research Center for Environmental Changes (instution_id='AS-RCEC'),

  • run as part of the Aerosols and Chemistry Model Intercomparison Project (activity_id='AerChemMIP')

And is located in Google Cloud Storage at gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/.

Finding unique entries

Let’s query the data to see what models (source_id), experiments (experiment_id) and temporal frequencies (table_id) are available.

import pprint

uni_dict = col.unique(["source_id", "experiment_id", "table_id"])
pprint.pprint(uni_dict, compact=True)
{'experiment_id': {'count': 160,
                   'values': ['1pctCO2', '1pctCO2-bgc', '1pctCO2-cdr',
                              '1pctCO2-rad', 'abrupt-0p5xCO2', 'abrupt-2xCO2',
                              'abrupt-4xCO2', 'abrupt-solm4p', 'abrupt-solp4p',
                              'amip', 'amip-4xCO2', 'amip-future4K',
                              'amip-hist', 'amip-lwoff', 'amip-m4K', 'amip-p4K',
                              'amip-p4K-lwoff', 'aqua-4xCO2', 'aqua-control',
                              'aqua-control-lwoff', 'aqua-p4K',
                              'aqua-p4K-lwoff', 'control-1950', 'dcppA-assim',
                              'dcppA-hindcast', 'dcppC-amv-ExTrop-neg',
                              'dcppC-amv-ExTrop-pos', 'dcppC-amv-Trop-neg',
                              'dcppC-amv-Trop-pos', 'dcppC-amv-neg',
                              'dcppC-amv-pos', 'dcppC-atl-control',
                              'dcppC-atl-pacemaker', 'dcppC-hindcast-noAgung',
                              'dcppC-hindcast-noElChichon',
                              'dcppC-hindcast-noPinatubo',
                              'dcppC-ipv-NexTrop-neg', 'dcppC-ipv-NexTrop-pos',
                              'dcppC-ipv-neg', 'dcppC-ipv-pos',
                              'dcppC-pac-control', 'dcppC-pac-pacemaker',
                              'deforest-globe', 'esm-hist', 'esm-pi-CO2pulse',
                              'esm-pi-cdr-pulse', 'esm-piControl',
                              'esm-piControl-spinup', 'esm-ssp585',
                              'esm-ssp585-ssp126Lu', 'faf-all', 'faf-heat',
                              'faf-heat-NA0pct', 'faf-heat-NA50pct',
                              'faf-passiveheat', 'faf-stress', 'faf-water',
                              'futSST-pdSIC', 'highresSST-future',
                              'highresSST-present', 'hist-1950', 'hist-1950HC',
                              'hist-CO2', 'hist-GHG', 'hist-GHG-cmip5',
                              'hist-aer', 'hist-aer-cmip5', 'hist-bgc',
                              'hist-nat', 'hist-nat-cmip5', 'hist-noLu',
                              'hist-piAer', 'hist-piNTCF', 'hist-resIPO',
                              'hist-sol', 'hist-stratO3', 'hist-totalO3',
                              'hist-volc', 'histSST', 'histSST-1950HC',
                              'histSST-piAer', 'histSST-piCH4',
                              'histSST-piNTCF', 'histSST-piO3', 'historical',
                              'historical-cmip5', 'historical-ext', 'land-hist',
                              'land-hist-altStartYear', 'land-noLu', 'lgm',
                              'lig127k', 'midHolocene', 'omip1', 'pa-futArcSIC',
                              'pa-pdSIC', 'past1000', 'pdSST-futAntSIC',
                              'pdSST-futArcSIC', 'pdSST-pdSIC',
                              'pdSST-piAntSIC', 'pdSST-piArcSIC',
                              'piClim-2xDMS', 'piClim-2xNOx', 'piClim-2xVOC',
                              'piClim-2xdust', 'piClim-2xfire', 'piClim-2xss',
                              'piClim-4xCO2', 'piClim-BC', 'piClim-CH4',
                              'piClim-HC', 'piClim-N2O', 'piClim-NOx',
                              'piClim-NTCF', 'piClim-O3', 'piClim-OC',
                              'piClim-SO2', 'piClim-VOC', 'piClim-aer',
                              'piClim-anthro', 'piClim-control', 'piClim-ghg',
                              'piClim-histaer', 'piClim-histall',
                              'piClim-histghg', 'piClim-histnat', 'piClim-lu',
                              'piControl', 'piControl-cmip5',
                              'piControl-spinup', 'piSST-pdSIC', 'piSST-piSIC',
                              'rcp26-cmip5', 'rcp45-cmip5', 'rcp85-cmip5',
                              'ssp119', 'ssp126', 'ssp126-ssp370Lu', 'ssp245',
                              'ssp245-GHG', 'ssp245-aer', 'ssp245-cov-fossil',
                              'ssp245-cov-modgreen', 'ssp245-cov-strgreen',
                              'ssp245-covid', 'ssp245-nat', 'ssp245-stratO3',
                              'ssp370', 'ssp370-lowNTCF', 'ssp370-ssp126Lu',
                              'ssp370SST', 'ssp370SST-lowCH4',
                              'ssp370SST-lowNTCF', 'ssp370SST-ssp126Lu',
                              'ssp370pdSST', 'ssp434', 'ssp460', 'ssp534-over',
                              'ssp585']},
 'source_id': {'count': 84,
               'values': ['ACCESS-CM2', 'ACCESS-ESM1-5', 'AWI-CM-1-1-MR',
                          'AWI-ESM-1-1-LR', 'BCC-CSM2-HR', 'BCC-CSM2-MR',
                          'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0',
                          'CESM1-1-CAM5-CMIP5', 'CESM2', 'CESM2-FV2',
                          'CESM2-WACCM', 'CESM2-WACCM-FV2', 'CIESM',
                          'CMCC-CM2-HR4', 'CMCC-CM2-SR5', 'CMCC-CM2-VHR4',
                          'CMCC-ESM2', 'CNRM-CM6-1', 'CNRM-CM6-1-HR',
                          'CNRM-ESM2-1', 'CanESM5', 'CanESM5-CanOE', 'E3SM-1-0',
                          'E3SM-1-1', 'E3SM-1-1-ECA', 'EC-Earth3',
                          'EC-Earth3-AerChem', 'EC-Earth3-CC', 'EC-Earth3-LR',
                          'EC-Earth3-Veg', 'EC-Earth3-Veg-LR', 'EC-Earth3P',
                          'EC-Earth3P-HR', 'EC-Earth3P-VHR', 'ECMWF-IFS-HR',
                          'ECMWF-IFS-LR', 'FGOALS-f3-H', 'FGOALS-f3-L',
                          'FGOALS-g3', 'FIO-ESM-2-0', 'GFDL-AM4', 'GFDL-CM4',
                          'GFDL-CM4C192', 'GFDL-ESM2M', 'GFDL-ESM4',
                          'GFDL-OM4p5B', 'GISS-E2-1-G', 'GISS-E2-1-G-CC',
                          'GISS-E2-1-H', 'GISS-E2-2-G', 'HadGEM3-GC31-HM',
                          'HadGEM3-GC31-LL', 'HadGEM3-GC31-LM',
                          'HadGEM3-GC31-MM', 'IITM-ESM', 'INM-CM4-8',
                          'INM-CM5-0', 'INM-CM5-H', 'IPSL-CM5A2-INCA',
                          'IPSL-CM6A-ATM-HR', 'IPSL-CM6A-LR',
                          'IPSL-CM6A-LR-INCA', 'KACE-1-0-G', 'KIOST-ESM',
                          'MCM-UA-1-0', 'MIROC-ES2L', 'MIROC6',
                          'MPI-ESM-1-2-HAM', 'MPI-ESM1-2-HR', 'MPI-ESM1-2-LR',
                          'MPI-ESM1-2-XR', 'MRI-AGCM3-2-H', 'MRI-AGCM3-2-S',
                          'MRI-ESM2-0', 'NESM3', 'NorCPM1', 'NorESM1-F',
                          'NorESM2-LM', 'NorESM2-MM', 'SAM0-UNICON', 'TaiESM1',
                          'UKESM1-0-LL']},
 'table_id': {'count': 37,
              'values': ['3hr', '6hrLev', '6hrPlev', '6hrPlevPt', 'AERday',
                         'AERhr', 'AERmon', 'AERmonZ', 'Aclim', 'Amon', 'CF3hr',
                         'CFday', 'CFmon', 'E1hrClimMon', 'E3hr', 'Eclim',
                         'Eday', 'EdayZ', 'Efx', 'Emon', 'EmonZ', 'Eyr',
                         'IfxGre', 'ImonGre', 'LImon', 'Lmon', 'Oclim', 'Oday',
                         'Odec', 'Ofx', 'Omon', 'Oyr', 'SIclim', 'SIday',
                         'SImon', 'day', 'fx']}}

Searching for specific datasets

In the example below, we are are going to search for the following:

  • variables: o2 which stands for mole_concentration_of_dissolved_molecular_oxygen_in_sea_water

  • experiments: ['historical', 'ssp585']:

    • historical: all forcing of the recent past.

    • ssp585: emission-driven RCP8.5 based on SSP5.

  • table_id: Oyr which stands for annual mean variables on the ocean grid.

  • grid_label: gn which stands for data reported on a model’s native grid.

For more details on the CMIP6 vocabulary, please check this website, and Core Controlled Vocabularies (CVs) for use in CMIP6 GitHub repository.

cat = col.search(
    experiment_id=["historical", "ssp585"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
)

cat

pangeo-cmip6 catalog with 23 dataset(s) from 149 asset(s):

unique
activity_id 2
institution_id 11
source_id 12
experiment_id 2
member_id 41
table_id 1
variable_id 1
grid_label 1
zstore 149
dcpp_init_year 0
version 20
cat.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CMIP CCCma CanESM5-CanOE historical r1i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical... NaN 20190429
1 CMIP CCCma CanESM5-CanOE historical r2i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical... NaN 20190429
2 CMIP CCCma CanESM5-CanOE historical r3i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical... NaN 20190429
3 CMIP CCCma CanESM5 historical r10i1p1f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1... NaN 20190429
4 CMIP CCCma CanESM5 historical r10i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1... NaN 20190429

Loading datasets Using to_dataset_dict()

dset_dict = cat.to_dataset_dict(
    zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [23/23 00:06<00:00]
[key for key in dset_dict.keys()]
['ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp585.Oyr.gn',
 'ScenarioMIP.MRI.MRI-ESM2-0.ssp585.Oyr.gn',
 'ScenarioMIP.MPI-M.MPI-ESM1-2-LR.ssp585.Oyr.gn',
 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Oyr.gn',
 'ScenarioMIP.NCC.NorESM2-MM.ssp585.Oyr.gn',
 'CMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.historical.Oyr.gn',
 'CMIP.NCC.NorESM2-MM.historical.Oyr.gn',
 'CMIP.CCCma.CanESM5-CanOE.historical.Oyr.gn',
 'ScenarioMIP.CCCma.CanESM5.ssp585.Oyr.gn',
 'ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp585.Oyr.gn',
 'ScenarioMIP.CCCma.CanESM5-CanOE.ssp585.Oyr.gn',
 'ScenarioMIP.NCAR.CESM2.ssp585.Oyr.gn',
 'ScenarioMIP.DWD.MPI-ESM1-2-HR.ssp585.Oyr.gn',
 'CMIP.MIROC.MIROC-ES2L.historical.Oyr.gn',
 'CMIP.MPI-M.MPI-ESM1-2-HR.historical.Oyr.gn',
 'CMIP.CSIRO.ACCESS-ESM1-5.historical.Oyr.gn',
 'ScenarioMIP.MIROC.MIROC-ES2L.ssp585.Oyr.gn',
 'CMIP.MPI-M.MPI-ESM1-2-LR.historical.Oyr.gn',
 'ScenarioMIP.NCC.NorESM2-LM.ssp585.Oyr.gn',
 'CMIP.NCC.NorESM2-LM.historical.Oyr.gn',
 'CMIP.CCCma.CanESM5.historical.Oyr.gn',
 'CMIP.MRI.MRI-ESM2-0.historical.Oyr.gn',
 'CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn']

We can access a particular dataset as follows:

ds = dset_dict["CMIP.CCCma.CanESM5.historical.Oyr.gn"]
print(ds)
<xarray.Dataset>
Dimensions:    (i: 360, j: 291, lev: 45, member_id: 35, time: 165)
Coordinates:
  * i          (i) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
  * j          (j) int32 0 1 2 3 4 5 6 7 8 ... 283 284 285 286 287 288 289 290
    latitude   (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * lev        (lev) float64 3.047 9.454 16.36 ... 5.126e+03 5.375e+03 5.625e+03
    longitude  (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time       (time) object 1850-07-02 12:00:00 ... 2014-07-02 12:00:00
  * member_id  (member_id) <U9 'r10i1p1f1' 'r10i1p2f1' ... 'r9i1p1f1' 'r9i1p2f1'
Data variables:
    o2         (member_id, time, lev, j, i) float32 dask.array<chunksize=(1, 12, 45, 291, 360), meta=np.ndarray>
Attributes:
    branch_time_in_child:        0.0
    realm:                       ocnBgchem
    parent_time_units:           days since 1850-01-01 0:0:0.0
    CCCma_runid:                 p2-his09
    title:                       CanESM5 output prepared for CMIP6
    Conventions:                 CF-1.7 CMIP-6.2
    version:                     v20190429
    CCCma_parent_runid:          p2-pictrl
    variant_label:               r9i1p2f1
    realization_index:           9
    status:                      2019-10-25;created;by nhn2@columbia.edu
    parent_experiment_id:        piControl
    institution_id:              CCCma
    branch_method:               Spin-up documentation
    experiment:                  all-forcing simulation of the recent past
    forcing_index:               1
    initialization_index:        1
    product:                     model-output
    YMDH_branch_time_in_child:   1850:01:01:00
    frequency:                   yr
    activity_id:                 CMIP
    references:                  Geophysical Model Development Special issue ...
    contact:                     ec.cccma.info-info.ccmac.ec@canada.ca
    source_id:                   CanESM5
    data_specs_version:          01.00.29
    cmor_version:                3.4.0
    external_variables:          areacello volcello
    CCCma_model_hash:            Unknown
    YMDH_branch_time_in_parent:  5950:01:01:00
    mip_era:                     CMIP6
    intake_esm_varname:          ['o2']
    variable_id:                 o2
    grid_label:                  gn
    license:                     CMIP6 model data produced by The Government ...
    table_id:                    Oyr
    nominal_resolution:          100 km
    grid:                        ORCA1 tripolar grid, 1 deg with refinement t...
    source_type:                 AOGCM
    parent_mip_era:              CMIP6
    sub_experiment:              none
    parent_activity_id:          CMIP
    experiment_id:               historical
    institution:                 Canadian Centre for Climate Modelling and An...
    table_info:                  Creation Date:(20 February 2019) MD5:374fbe5...
    further_info_url:            https://furtherinfo.es-doc.org/CMIP6.CCCma.C...
    creation_date:               2019-05-30T08:58:45Z
    parent_source_id:            CanESM5
    branch_time_in_parent:       1496500.0
    history:                     2019-05-02T13:53:53Z ;rewrote data to be con...
    source:                      CanESM5 (2019): \naerosol: interactive\natmo...
    tracking_id:                 hdl:21.14100/41426118-701c-482b-ae16-82932e4...
    sub_experiment_id:           none
    intake_esm_dataset_key:      CMIP.CCCma.CanESM5.historical.Oyr.gn

Let’s create a quick plot for a slice of the data:

ds.o2.isel(time=0, lev=0, member_id=range(1, 24, 4)).plot(
    col="member_id", col_wrap=3, robust=True
)
<xarray.plot.facetgrid.FacetGrid at 0x7fd599a2e0d0>
../_images/cmip6-tutorial_19_1.png

Using custom preprocessing functions

When comparing many models it is often necessary to preprocess (e.g. rename certain variables) them before running some analysis step. The preprocess argument lets the user pass a function, which is executed for each loaded asset before aggregations.

cat_pp = col.search(
    experiment_id=["historical"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
    source_id=["IPSL-CM6A-LR", "CanESM5"],
    member_id="r10i1p1f1",
)
cat_pp.df
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CMIP CCCma CanESM5 historical r10i1p1f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1... NaN 20190429
1 CMIP IPSL IPSL-CM6A-LR historical r10i1p1f1 Oyr o2 gn gs://cmip6/CMIP/IPSL/IPSL-CM6A-LR/historical/r... NaN 20180803
# load the example
dset_dict_raw = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [2/2 00:00<00:00]
for k, ds in dset_dict_raw.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn
	dimensions=['member_id', 'olevel', 'time', 'x', 'y']

dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn
	dimensions=['i', 'j', 'lev', 'member_id', 'time']

Note

Note that both models follow a different naming scheme. We can define a little helper function and pass it to .to_dataset_dict() to fix this. For demonstration purposes we will focus on the vertical level dimension which is called lev in CanESM5 and olevel in IPSL-CM6A-LR.

def helper_func(ds):
    """Rename `olevel` dim to `lev`"""
    ds = ds.copy()
    # a short example
    if "olevel" in ds.dims:
        ds = ds.rename({"olevel": "lev"})
    return ds
dset_dict_fixed = cat_pp.to_dataset_dict(
    zarr_kwargs={"consolidated": True}, preprocess=helper_func
)
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [2/2 00:00<00:00]
for k, ds in dset_dict_fixed.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn
	dimensions=['lev', 'member_id', 'time', 'x', 'y']

dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn
	dimensions=['i', 'j', 'lev', 'member_id', 'time']

This was just an example for one dimension.

Note

Check out cmip6-preprocessing package for a full renaming function for all available CMIP6 models and some other utilities.