{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load CMIP6 Data with Intake ESM\n",
    "\n",
    "This notebook demonstrates how to access Google Cloud CMIP6 data using\n",
    "intake-esm.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading a catalog\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import intake"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://storage.googleapis.com/cmip6/pangeo-cmip6.json\"\n",
    "col = intake.open_esm_datastore(url)\n",
    "col"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The summary above tells us that this catalog contains over 268,000 data assets.\n",
    "We can get more information on the individual data assets contained in the\n",
    "catalog by calling the underlying dataframe created when it is initialized:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Catalog Contents\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first data asset listed in the catalog contains:\n",
    "\n",
    "- the ambient aerosol optical thickness at 550nm (`variable_id='od550aer'`), as\n",
    "  a function of latitude, longitude, time,\n",
    "- in an individual climate model experiment with the Taiwan Earth System Model\n",
    "  1.0 model (`source_id='TaiESM1'`),\n",
    "- forced by the _Historical transient with SSTs prescribed from historical_\n",
    "  experiment (`experiment_id='histSST'`),\n",
    "- developed by the Taiwan Research Center for Environmental Changes\n",
    "  (`instution_id='AS-RCEC'`),\n",
    "- run as part of the Aerosols and Chemistry Model Intercomparison Project\n",
    "  (`activity_id='AerChemMIP'`)\n",
    "\n",
    "And is located in Google Cloud Storage at\n",
    "`gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding unique entries\n",
    "\n",
    "Let's query the data to see what models (`source_id`), experiments\n",
    "(`experiment_id`) and temporal frequencies (`table_id`) are available.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pprint\n",
    "\n",
    "uni_dict = col.unique([\"source_id\", \"experiment_id\", \"table_id\"])\n",
    "pprint.pprint(uni_dict, compact=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Searching for specific datasets\n",
    "\n",
    "In the example below, we are are going to search for the following:\n",
    "\n",
    "- variables: `o2` which stands for\n",
    "  `mole_concentration_of_dissolved_molecular_oxygen_in_sea_water`\n",
    "- experiments: `['historical', 'ssp585']`:\n",
    "  - `historical`: all forcing of the recent past.\n",
    "  - `ssp585`: emission-driven\n",
    "    [RCP8.5](https://en.wikipedia.org/wiki/Representative_Concentration_Pathway)\n",
    "    based on SSP5.\n",
    "- table_id: `Oyr` which stands for annual mean variables on the ocean grid.\n",
    "- grid_label: `gn` which stands for data reported on a model's native grid.\n",
    "\n",
    "For more details on the CMIP6 vocabulary, please check this\n",
    "[website](http://clipc-services.ceda.ac.uk/dreq/index.html), and\n",
    "[Core Controlled Vocabularies (CVs) for use in CMIP6](https://github.com/WCRP-CMIP/CMIP6_CVs)\n",
    "GitHub repository.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat = col.search(\n",
    "    experiment_id=[\"historical\", \"ssp585\"],\n",
    "    table_id=\"Oyr\",\n",
    "    variable_id=\"o2\",\n",
    "    grid_label=\"gn\",\n",
    ")\n",
    "\n",
    "cat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading datasets Using `to_dataset_dict()`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dset_dict = cat.to_dataset_dict(\n",
    "    zarr_kwargs={\"consolidated\": True, \"decode_times\": True, \"use_cftime\": True}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[key for key in dset_dict.keys()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can access a particular dataset as follows:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = dset_dict[\"CMIP.CCCma.CanESM5.historical.Oyr.gn\"]\n",
    "print(ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s create a quick plot for a slice of the data:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.o2.isel(time=0, lev=0, member_id=range(1, 24, 4)).plot(col=\"member_id\", col_wrap=3, robust=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using custom preprocessing functions\n",
    "\n",
    "When comparing many models it is often necessary to preprocess (e.g. rename\n",
    "certain variables) them before running some analysis step. The `preprocess`\n",
    "argument lets the user pass a function, which is executed for each loaded asset\n",
    "before aggregations.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_pp = col.search(\n",
    "    experiment_id=[\"historical\"],\n",
    "    table_id=\"Oyr\",\n",
    "    variable_id=\"o2\",\n",
    "    grid_label=\"gn\",\n",
    "    source_id=[\"IPSL-CM6A-LR\", \"CanESM5\"],\n",
    "    member_id=\"r10i1p1f1\",\n",
    ")\n",
    "cat_pp.df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the example\n",
    "dset_dict_raw = cat_pp.to_dataset_dict(zarr_kwargs={\"consolidated\": True})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for k, ds in dset_dict_raw.items():\n",
    "    print(f\"dataset key={k}\\n\\tdimensions={sorted(list(ds.dims))}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{note}\n",
    "Note that both models follow a different naming scheme. We can define a little\n",
    "helper function and pass it to `.to_dataset_dict()` to fix this. For\n",
    "demonstration purposes we will focus on the vertical level dimension which is\n",
    "called `lev` in `CanESM5` and `olevel` in `IPSL-CM6A-LR`.\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def helper_func(ds):\n",
    "    \"\"\"Rename `olevel` dim to `lev`\"\"\"\n",
    "    ds = ds.copy()\n",
    "    # a short example\n",
    "    if \"olevel\" in ds.dims:\n",
    "        ds = ds.rename({\"olevel\": \"lev\"})\n",
    "    return ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dset_dict_fixed = cat_pp.to_dataset_dict(zarr_kwargs={\"consolidated\": True}, preprocess=helper_func)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for k, ds in dset_dict_fixed.items():\n",
    "    print(f\"dataset key={k}\\n\\tdimensions={sorted(list(ds.dims))}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This was just an example for one dimension.\n",
    "\n",
    "```{note}\n",
    "Check out [cmip6-preprocessing package](https://github.com/jbusecke/cmip6_preprocessing)\n",
    "for a full renaming function for all available CMIP6 models and some other\n",
    "utilities.\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake_esm  # just to display version information\n",
    "\n",
    "intake_esm.show_versions()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}