# Load CMIP6 Data with Intake ESM

This notebook demonstrates how to access Google Cloud CMIP6 data using
intake-esm.


## Loading a catalog


In [None]:
import warnings

warnings.filterwarnings("ignore")
import intake

In [None]:
url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col

The summary above tells us that this catalog contains over 268,000 data assets.
We can get more information on the individual data assets contained in the
catalog by calling the underlying dataframe created when it is initialized:


### Catalog Contents


In [None]:
col.df.head()

The first data asset listed in the catalog contains:

- the ambient aerosol optical thickness at 550nm (`variable_id='od550aer'`), as
 a function of latitude, longitude, time,
- in an individual climate model experiment with the Taiwan Earth System Model
 1.0 model (`source_id='TaiESM1'`),
- forced by the _Historical transient with SSTs prescribed from historical_
 experiment (`experiment_id='histSST'`),
- developed by the Taiwan Research Center for Environmental Changes
 (`instution_id='AS-RCEC'`),
- run as part of the Aerosols and Chemistry Model Intercomparison Project
 (`activity_id='AerChemMIP'`)

And is located in Google Cloud Storage at
`gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/`.


## Finding unique entries

Let's query the data to see what models (`source_id`), experiments
(`experiment_id`) and temporal frequencies (`table_id`) are available.


In [None]:
import pprint

uni_dict = col.unique(["source_id", "experiment_id", "table_id"])
pprint.pprint(uni_dict, compact=True)

## Searching for specific datasets

In the example below, we are are going to search for the following:

- variables: `o2` which stands for
 `mole_concentration_of_dissolved_molecular_oxygen_in_sea_water`
- experiments: `['historical', 'ssp585']`:
 - `historical`: all forcing of the recent past.
 - `ssp585`: emission-driven
 [RCP8.5](https://en.wikipedia.org/wiki/Representative_Concentration_Pathway)
 based on SSP5.
- table_id: `Oyr` which stands for annual mean variables on the ocean grid.
- grid_label: `gn` which stands for data reported on a model's native grid.

For more details on the CMIP6 vocabulary, please check this
[website](http://clipc-services.ceda.ac.uk/dreq/index.html), and
[Core Controlled Vocabularies (CVs) for use in CMIP6](https://github.com/WCRP-CMIP/CMIP6_CVs)
GitHub repository.


In [None]:
cat = col.search(
 experiment_id=["historical", "ssp585"],
 table_id="Oyr",
 variable_id="o2",
 grid_label="gn",
)

cat

In [None]:
cat.df.head()

## Loading datasets Using `to_dataset_dict()`


In [None]:
dset_dict = cat.to_dataset_dict(
 zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)

In [None]:
[key for key in dset_dict.keys()]

We can access a particular dataset as follows:


In [None]:
ds = dset_dict["CMIP.CCCma.CanESM5.historical.Oyr.gn"]
print(ds)

Let’s create a quick plot for a slice of the data:


In [None]:
ds.o2.isel(time=0, lev=0, member_id=range(1, 24, 4)).plot(col="member_id", col_wrap=3, robust=True)

## Using custom preprocessing functions

When comparing many models it is often necessary to preprocess (e.g. rename
certain variables) them before running some analysis step. The `preprocess`
argument lets the user pass a function, which is executed for each loaded asset
before aggregations.


In [None]:
cat_pp = col.search(
 experiment_id=["historical"],
 table_id="Oyr",
 variable_id="o2",
 grid_label="gn",
 source_id=["IPSL-CM6A-LR", "CanESM5"],
 member_id="r10i1p1f1",
)
cat_pp.df

In [None]:
# load the example
dset_dict_raw = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True})

In [None]:
for k, ds in dset_dict_raw.items():
 print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

```{note}
Note that both models follow a different naming scheme. We can define a little
helper function and pass it to `.to_dataset_dict()` to fix this. For
demonstration purposes we will focus on the vertical level dimension which is
called `lev` in `CanESM5` and `olevel` in `IPSL-CM6A-LR`.
```


In [None]:
def helper_func(ds):
 """Rename `olevel` dim to `lev`"""
 ds = ds.copy()
 # a short example
 if "olevel" in ds.dims:
 ds = ds.rename({"olevel": "lev"})
 return ds

In [None]:
dset_dict_fixed = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True}, preprocess=helper_func)

In [None]:
for k, ds in dset_dict_fixed.items():
 print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")

This was just an example for one dimension.

```{note}
Check out [cmip6-preprocessing package](https://github.com/jbusecke/cmip6_preprocessing)
for a full renaming function for all available CMIP6 models and some other
utilities.
```


In [None]:
import intake_esm # just to display version information

intake_esm.show_versions()