{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Manipulating DataFrame (in-memory catalog)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import intake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The in-memory representation of an Earth System Model (ESM) catalog is a pandas\n",
    "dataframe, and is accessible via the `.df` property:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://storage.googleapis.com/cmip6/pangeo-cmip6.json\"\n",
    "col = intake.open_esm_datastore(url)\n",
    "col.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will go through some examples showing how to manipulate this\n",
    "dataframe outside of intake-esm.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use Case 1: Complex Search Queries\n",
    "\n",
    "Let's say we are interested in datasets with the following attributes:\n",
    "\n",
    "- `experiment_id=[\"historical\"]`\n",
    "- `table_id=\"Amon\"`\n",
    "- `variable_id=\"tas\"`\n",
    "- `source_id=['TaiESM1', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'UKESM1-0-LL']`\n",
    "\n",
    "In addition to these attributes, **we are interested in the first ensemble\n",
    "member (member_id) of each model (source_id) only**.\n",
    "\n",
    "This can be achieved in two steps:\n",
    "\n",
    "### Step 1: Run a query against the catalog\n",
    "\n",
    "We can run a query against the catalog:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_subset = col.search(\n",
    "    experiment_id=[\"historical\"],\n",
    "    table_id=\"Amon\",\n",
    "    variable_id=\"tas\",\n",
    "    source_id=[\n",
    "        \"TaiESM1\",\n",
    "        \"AWI-CM-1-1-MR\",\n",
    "        \"AWI-ESM-1-1-LR\",\n",
    "        \"BCC-CSM2-MR\",\n",
    "        \"BCC-ESM1\",\n",
    "        \"CAMS-CSM1-0\",\n",
    "        \"CAS-ESM2-0\",\n",
    "        \"UKESM1-0-LL\",\n",
    "    ],\n",
    ")\n",
    "col_subset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Select the first `member_id` for each `source_id`\n",
    "\n",
    "The subsetted catalog contains `source_id` with the following number of\n",
    "`member_id` per `source_id`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_subset.df.groupby(\"source_id\")[\"member_id\"].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get the first `member_id` for each `source_id`, we group the dataframe by\n",
    "`source_id` and use the `.first()` function to retrieve the first `member_id`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = col_subset.df.groupby([\"source_id\"])\n",
    "df = grouped.first().reset_index()\n",
    "\n",
    "# Confirm that we have one ensemble member per source_id\n",
    "\n",
    "df.groupby(\"source_id\")[\"member_id\"].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Attach the new dataframe to our catalog object\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_subset.df = df\n",
    "col_subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dsets = col_subset.to_dataset_dict(zarr_kwargs={\"consolidated\": True})\n",
    "[key for key in dsets]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dsets[\"CMIP.CAS.CAS-ESM2-0.historical.Amon.gn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake_esm  # just to display version information\n",
    "\n",
    "intake_esm.show_versions()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}