This notebook demonstrates how to access Google Cloud CMIP6 data using intake-esm.
import warnings warnings.filterwarnings("ignore") import intake
url = ( "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json" ) col = intake.open_esm_datastore(url) col
pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):
The summary above tells us that this catalog contains over 268,000 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:
col.df.head()
The first data asset listed in the catalog contains:
the ambient aerosol optical thickness at 550nm (variable_id='od550aer'), as a function of latitude, longitude, time,
variable_id='od550aer'
in an individual climate model experiment with the Taiwan Earth System Model 1.0 model (source_id='TaiESM1'),
source_id='TaiESM1'
forced by the Historical transient with SSTs prescribed from historical experiment (experiment_id='histSST'),
experiment_id='histSST'
developed by the Taiwan Research Center for Environmental Changes (instution_id='AS-RCEC'),
instution_id='AS-RCEC'
run as part of the Aerosols and Chemistry Model Intercomparison Project (activity_id='AerChemMIP')
activity_id='AerChemMIP'
And is located in Google Cloud Storage at gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/.
gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/
Let’s query the data to see what models (source_id), experiments (experiment_id) and temporal frequencies (table_id) are available.
source_id
experiment_id
table_id
import pprint uni_dict = col.unique(["source_id", "experiment_id", "table_id"]) pprint.pprint(uni_dict, compact=True)
{'experiment_id': {'count': 160, 'values': ['1pctCO2', '1pctCO2-bgc', '1pctCO2-cdr', '1pctCO2-rad', 'abrupt-0p5xCO2', 'abrupt-2xCO2', 'abrupt-4xCO2', 'abrupt-solm4p', 'abrupt-solp4p', 'amip', 'amip-4xCO2', 'amip-future4K', 'amip-hist', 'amip-lwoff', 'amip-m4K', 'amip-p4K', 'amip-p4K-lwoff', 'aqua-4xCO2', 'aqua-control', 'aqua-control-lwoff', 'aqua-p4K', 'aqua-p4K-lwoff', 'control-1950', 'dcppA-assim', 'dcppA-hindcast', 'dcppC-amv-ExTrop-neg', 'dcppC-amv-ExTrop-pos', 'dcppC-amv-Trop-neg', 'dcppC-amv-Trop-pos', 'dcppC-amv-neg', 'dcppC-amv-pos', 'dcppC-atl-control', 'dcppC-atl-pacemaker', 'dcppC-hindcast-noAgung', 'dcppC-hindcast-noElChichon', 'dcppC-hindcast-noPinatubo', 'dcppC-ipv-NexTrop-neg', 'dcppC-ipv-NexTrop-pos', 'dcppC-ipv-neg', 'dcppC-ipv-pos', 'dcppC-pac-control', 'dcppC-pac-pacemaker', 'deforest-globe', 'esm-hist', 'esm-pi-CO2pulse', 'esm-pi-cdr-pulse', 'esm-piControl', 'esm-piControl-spinup', 'esm-ssp585', 'esm-ssp585-ssp126Lu', 'faf-all', 'faf-heat', 'faf-heat-NA0pct', 'faf-heat-NA50pct', 'faf-passiveheat', 'faf-stress', 'faf-water', 'futSST-pdSIC', 'highresSST-future', 'highresSST-present', 'hist-1950', 'hist-1950HC', 'hist-CO2', 'hist-GHG', 'hist-GHG-cmip5', 'hist-aer', 'hist-aer-cmip5', 'hist-bgc', 'hist-nat', 'hist-nat-cmip5', 'hist-noLu', 'hist-piAer', 'hist-piNTCF', 'hist-resIPO', 'hist-sol', 'hist-stratO3', 'hist-totalO3', 'hist-volc', 'histSST', 'histSST-1950HC', 'histSST-piAer', 'histSST-piCH4', 'histSST-piNTCF', 'histSST-piO3', 'historical', 'historical-cmip5', 'historical-ext', 'land-hist', 'land-hist-altStartYear', 'land-noLu', 'lgm', 'lig127k', 'midHolocene', 'omip1', 'pa-futArcSIC', 'pa-pdSIC', 'past1000', 'pdSST-futAntSIC', 'pdSST-futArcSIC', 'pdSST-pdSIC', 'pdSST-piAntSIC', 'pdSST-piArcSIC', 'piClim-2xDMS', 'piClim-2xNOx', 'piClim-2xVOC', 'piClim-2xdust', 'piClim-2xfire', 'piClim-2xss', 'piClim-4xCO2', 'piClim-BC', 'piClim-CH4', 'piClim-HC', 'piClim-N2O', 'piClim-NOx', 'piClim-NTCF', 'piClim-O3', 'piClim-OC', 'piClim-SO2', 'piClim-VOC', 'piClim-aer', 'piClim-anthro', 'piClim-control', 'piClim-ghg', 'piClim-histaer', 'piClim-histall', 'piClim-histghg', 'piClim-histnat', 'piClim-lu', 'piControl', 'piControl-cmip5', 'piControl-spinup', 'piSST-pdSIC', 'piSST-piSIC', 'rcp26-cmip5', 'rcp45-cmip5', 'rcp85-cmip5', 'ssp119', 'ssp126', 'ssp126-ssp370Lu', 'ssp245', 'ssp245-GHG', 'ssp245-aer', 'ssp245-cov-fossil', 'ssp245-cov-modgreen', 'ssp245-cov-strgreen', 'ssp245-covid', 'ssp245-nat', 'ssp245-stratO3', 'ssp370', 'ssp370-lowNTCF', 'ssp370-ssp126Lu', 'ssp370SST', 'ssp370SST-lowCH4', 'ssp370SST-lowNTCF', 'ssp370SST-ssp126Lu', 'ssp370pdSST', 'ssp434', 'ssp460', 'ssp534-over', 'ssp585']}, 'source_id': {'count': 84, 'values': ['ACCESS-CM2', 'ACCESS-ESM1-5', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-HR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'CESM1-1-CAM5-CMIP5', 'CESM2', 'CESM2-FV2', 'CESM2-WACCM', 'CESM2-WACCM-FV2', 'CIESM', 'CMCC-CM2-HR4', 'CMCC-CM2-SR5', 'CMCC-CM2-VHR4', 'CMCC-ESM2', 'CNRM-CM6-1', 'CNRM-CM6-1-HR', 'CNRM-ESM2-1', 'CanESM5', 'CanESM5-CanOE', 'E3SM-1-0', 'E3SM-1-1', 'E3SM-1-1-ECA', 'EC-Earth3', 'EC-Earth3-AerChem', 'EC-Earth3-CC', 'EC-Earth3-LR', 'EC-Earth3-Veg', 'EC-Earth3-Veg-LR', 'EC-Earth3P', 'EC-Earth3P-HR', 'EC-Earth3P-VHR', 'ECMWF-IFS-HR', 'ECMWF-IFS-LR', 'FGOALS-f3-H', 'FGOALS-f3-L', 'FGOALS-g3', 'FIO-ESM-2-0', 'GFDL-AM4', 'GFDL-CM4', 'GFDL-CM4C192', 'GFDL-ESM2M', 'GFDL-ESM4', 'GFDL-OM4p5B', 'GISS-E2-1-G', 'GISS-E2-1-G-CC', 'GISS-E2-1-H', 'GISS-E2-2-G', 'HadGEM3-GC31-HM', 'HadGEM3-GC31-LL', 'HadGEM3-GC31-LM', 'HadGEM3-GC31-MM', 'IITM-ESM', 'INM-CM4-8', 'INM-CM5-0', 'INM-CM5-H', 'IPSL-CM5A2-INCA', 'IPSL-CM6A-ATM-HR', 'IPSL-CM6A-LR', 'IPSL-CM6A-LR-INCA', 'KACE-1-0-G', 'KIOST-ESM', 'MCM-UA-1-0', 'MIROC-ES2L', 'MIROC6', 'MPI-ESM-1-2-HAM', 'MPI-ESM1-2-HR', 'MPI-ESM1-2-LR', 'MPI-ESM1-2-XR', 'MRI-AGCM3-2-H', 'MRI-AGCM3-2-S', 'MRI-ESM2-0', 'NESM3', 'NorCPM1', 'NorESM1-F', 'NorESM2-LM', 'NorESM2-MM', 'SAM0-UNICON', 'TaiESM1', 'UKESM1-0-LL']}, 'table_id': {'count': 37, 'values': ['3hr', '6hrLev', '6hrPlev', '6hrPlevPt', 'AERday', 'AERhr', 'AERmon', 'AERmonZ', 'Aclim', 'Amon', 'CF3hr', 'CFday', 'CFmon', 'E1hrClimMon', 'E3hr', 'Eclim', 'Eday', 'EdayZ', 'Efx', 'Emon', 'EmonZ', 'Eyr', 'IfxGre', 'ImonGre', 'LImon', 'Lmon', 'Oclim', 'Oday', 'Odec', 'Ofx', 'Omon', 'Oyr', 'SIclim', 'SIday', 'SImon', 'day', 'fx']}}
In the example below, we are are going to search for the following:
variables: o2 which stands for mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
o2
mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
experiments: ['historical', 'ssp585']:
['historical', 'ssp585']
historical: all forcing of the recent past.
historical
ssp585: emission-driven RCP8.5 based on SSP5.
ssp585
table_id: Oyr which stands for annual mean variables on the ocean grid.
Oyr
grid_label: gn which stands for data reported on a model’s native grid.
gn
For more details on the CMIP6 vocabulary, please check this website, and Core Controlled Vocabularies (CVs) for use in CMIP6 GitHub repository.
cat = col.search( experiment_id=["historical", "ssp585"], table_id="Oyr", variable_id="o2", grid_label="gn", ) cat
pangeo-cmip6 catalog with 23 dataset(s) from 149 asset(s):
cat.df.head()
to_dataset_dict()
dset_dict = cat.to_dataset_dict( zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True} )
--> The keys in the returned dictionary of datasets are constructed as follows: 'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
[key for key in dset_dict.keys()]
['ScenarioMIP.DWD.MPI-ESM1-2-HR.ssp585.Oyr.gn', 'CMIP.MRI.MRI-ESM2-0.historical.Oyr.gn', 'ScenarioMIP.MRI.MRI-ESM2-0.ssp585.Oyr.gn', 'CMIP.MPI-M.MPI-ESM1-2-HR.historical.Oyr.gn', 'CMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.historical.Oyr.gn', 'CMIP.CCCma.CanESM5.historical.Oyr.gn', 'CMIP.NCC.NorESM2-MM.historical.Oyr.gn', 'ScenarioMIP.MPI-M.MPI-ESM1-2-LR.ssp585.Oyr.gn', 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Oyr.gn', 'CMIP.MPI-M.MPI-ESM1-2-LR.historical.Oyr.gn', 'CMIP.NCC.NorESM2-LM.historical.Oyr.gn', 'ScenarioMIP.MIROC.MIROC-ES2L.ssp585.Oyr.gn', 'ScenarioMIP.NCC.NorESM2-LM.ssp585.Oyr.gn', 'ScenarioMIP.NCC.NorESM2-MM.ssp585.Oyr.gn', 'CMIP.CCCma.CanESM5-CanOE.historical.Oyr.gn', 'ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp585.Oyr.gn', 'ScenarioMIP.NCAR.CESM2.ssp585.Oyr.gn', 'CMIP.MIROC.MIROC-ES2L.historical.Oyr.gn', 'CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn', 'ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp585.Oyr.gn', 'CMIP.CSIRO.ACCESS-ESM1-5.historical.Oyr.gn', 'ScenarioMIP.CCCma.CanESM5-CanOE.ssp585.Oyr.gn', 'ScenarioMIP.CCCma.CanESM5.ssp585.Oyr.gn']
We can access a particular dataset as follows:
ds = dset_dict["CMIP.CCCma.CanESM5.historical.Oyr.gn"] print(ds)
<xarray.Dataset> Dimensions: (bnds: 2, i: 360, j: 291, lev: 45, member_id: 35, time: 165, vertices: 4) Coordinates: * i (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359 * j (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290 latitude (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray> * lev (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03 lev_bnds (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray> longitude (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray> * time (time) object 1850-07-02 12:00:00 ... 2014-07-02 12:0... time_bnds (time, bnds) object dask.array<chunksize=(165, 2), meta=np.ndarray> * member_id (member_id) <U9 'r10i1p1f1' 'r10i1p2f1' ... 'r9i1p2f1' Dimensions without coordinates: bnds, vertices Data variables: o2 (member_id, time, lev, j, i) float32 dask.array<chunksize=(1, 12, 45, 291, 360), meta=np.ndarray> vertices_latitude (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray> vertices_longitude (j, i, vertices) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray> Attributes: table_id: Oyr variable_id: o2 sub_experiment_id: none activity_id: CMIP CCCma_runid: p2-his09 CCCma_model_hash: Unknown version: v20190429 creation_date: 2019-05-30T08:58:45Z parent_time_units: days since 1850-01-01 0:0:0.0 references: Geophysical Model Development Special issue ... variant_label: r9i1p2f1 product: model-output cmor_version: 3.4.0 source: CanESM5 (2019): \naerosol: interactive\natmo... source_id: CanESM5 YMDH_branch_time_in_child: 1850:01:01:00 tracking_id: hdl:21.14100/41426118-701c-482b-ae16-82932e4... Conventions: CF-1.7 CMIP-6.2 external_variables: areacello volcello parent_source_id: CanESM5 title: CanESM5 output prepared for CMIP6 further_info_url: https://furtherinfo.es-doc.org/CMIP6.CCCma.C... parent_mip_era: CMIP6 intake_esm_varname: ['o2'] parent_activity_id: CMIP experiment_id: historical realization_index: 9 parent_experiment_id: piControl source_type: AOGCM mip_era: CMIP6 frequency: yr grid: ORCA1 tripolar grid, 1 deg with refinement t... CCCma_parent_runid: p2-pictrl status: 2019-10-25;created;by nhn2@columbia.edu license: CMIP6 model data produced by The Government ... YMDH_branch_time_in_parent: 5950:01:01:00 branch_method: Spin-up documentation contact: ec.cccma.info-info.ccmac.ec@canada.ca table_info: Creation Date:(20 February 2019) MD5:374fbe5... sub_experiment: none forcing_index: 1 branch_time_in_child: 0.0 institution: Canadian Centre for Climate Modelling and An... realm: ocnBgchem branch_time_in_parent: 1496500.0 experiment: all-forcing simulation of the recent past institution_id: CCCma history: 2019-05-02T13:53:53Z ;rewrote data to be con... data_specs_version: 01.00.29 grid_label: gn nominal_resolution: 100 km initialization_index: 1 intake_esm_dataset_key: CMIP.CCCma.CanESM5.historical.Oyr.gn
Let’s create a quick plot for a slice of the data:
ds.o2.isel(time=0, lev=0, member_id=range(1, 24, 4)).plot(col="member_id", col_wrap=3, robust=True)
<xarray.plot.facetgrid.FacetGrid at 0x7ff83c2c7190>
When comparing many models it is often necessary to preprocess (e.g. rename certain variables) them before running some analysis step. The preprocess argument lets the user pass a function, which is executed for each loaded asset before aggregations.
preprocess
cat_pp = col.search( experiment_id=["historical"], table_id="Oyr", variable_id="o2", grid_label="gn", source_id=["IPSL-CM6A-LR", "CanESM5"], member_id="r10i1p1f1", ) cat_pp.df
# load the example dset_dict_raw = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True})
for k, ds in dset_dict_raw.items(): print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn dimensions=['axis_nbounds', 'member_id', 'nvertex', 'olevel', 'time', 'x', 'y'] dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn dimensions=['bnds', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']
Note
Note that both models follow a different naming scheme. We can define a little helper function and pass it to .to_dataset_dict() to fix this. For demonstration purposes we will focus on the vertical level dimension which is called lev in CanESM5 and olevel in IPSL-CM6A-LR.
.to_dataset_dict()
lev
CanESM5
olevel
IPSL-CM6A-LR
def helper_func(ds): """Rename `olevel` dim to `lev`""" ds = ds.copy() # a short example if "olevel" in ds.dims: ds = ds.rename({"olevel": "lev"}) return ds
dset_dict_fixed = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True}, preprocess=helper_func)
for k, ds in dset_dict_fixed.items(): print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn dimensions=['axis_nbounds', 'lev', 'member_id', 'nvertex', 'time', 'x', 'y'] dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn dimensions=['bnds', 'i', 'j', 'lev', 'member_id', 'time', 'vertices']
This was just an example for one dimension.
Check out cmip6-preprocessing package for a full renaming function for all available CMIP6 models and some other utilities.