{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overview\n",
    "\n",
    "Intake-esm is a data cataloging utility built on top of intake, pandas, and\n",
    "xarray. Intake-esm aims to facilitate:\n",
    "\n",
    "- the discovery of earth’s climate and weather datasets.\n",
    "- the ingestion of these datasets into xarray dataset containers.\n",
    "\n",
    "It's basic usage is shown below. To begin, let's import `intake`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading a catalog\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At import time, intake-esm plugin is available in intake’s registry as\n",
    "`esm_datastore` and can be accessed with `intake.open_esm_datastore()` function.\n",
    "For demonstration purposes, we are going to use the catalog for Community Earth\n",
    "System Model Large ensemble (CESM LENS) dataset publicly available in Amazon S3.\n",
    "\n",
    "```{note}\n",
    "You can learn more about CESM LENS dataset in AWS S3 [here](https://registry.opendata.aws/ncar-cesm-lens/)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can load data from an\n",
    "[ESM Catalog](https://github.com/NCAR/esm-collection-spec) by providing the URL\n",
    "to valid ESM Catalog:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "catalog_url = \"https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json\"\n",
    "col = intake.open_esm_datastore(catalog_url)\n",
    "col"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The summary above tells us that this catalog contains over 400 data assets. We\n",
    "can get more information on the individual data assets contained in the catalog\n",
    "by calling the underlying dataframe created when it is initialized:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding unique entries for individual columns\n",
    "\n",
    "To get unique values for given columns in the catalog, intake-esm provides a\n",
    "{py:meth}`~intake_esm.core.esm_datastore.unique` method. This method returns a\n",
    "dictionary containing count, and unique values:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.unique(columns=[\"component\", \"frequency\", \"experiment\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search\n",
    "\n",
    "The {py:meth}`~intake_esm.core.esm_datastore.search` method allows the user to\n",
    "perform a query on a catalog using keyword arguments. The keyword argument names\n",
    "must be the names of the columns in the catalog. The search method returns a\n",
    "subset of the catalog with all the entries that match the provided query.\n",
    "\n",
    "### Exact Match Keywords\n",
    "\n",
    "By default, the {py:meth}`~intake_esm.core.esm_datastore.search` method looks\n",
    "for exact matches\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_subset = col.search(\n",
    "    component=[\"ice_nh\", \"lnd\"],\n",
    "    frequency=[\"monthly\"],\n",
    "    experiment=[\"20C\", \"HIST\"],\n",
    ")\n",
    "col_subset.df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Substring matches\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As pointed earlier, the search method looks for exact matches by default.\n",
    "However, with use of wildcards and/or regular expressions, we can find all items\n",
    "with a particular substring in a given column:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find all entries with `wind` in their variable long_name\n",
    "col.search(long_name=\"wind*\").df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find all entries whose variable long name starts with `wind`\n",
    "col.search(long_name=\"^wind\").df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading datasets\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Intake-esm implements convenience utilities for loading the query results into\n",
    "higher level xarray datasets. The logic for merging/concatenating the query\n",
    "results into higher level xarray datasets is provided in the input JSON file and\n",
    "is available under `.aggregation_info` property:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.aggregation_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.aggregation_info.aggregations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dataframe columns used to determine groups of compatible datasets.\n",
    "col.aggregation_info.groupby_attrs  # or col.groupby_attrs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.\n",
    "col.aggregation_info.agg_columns  # or col.agg_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To load data assets into xarray datasets, we need to use the\n",
    "{py:meth}`~intake_esm.core.esm_datastore.to_dataset_dict` method. This method\n",
    "returns a dictionary of aggregate xarray datasets as the name hints.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dset_dicts = col_subset.to_dataset_dict(zarr_kwargs={\"consolidated\": True})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[key for key in dset_dicts.keys()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can access a particular dataset as follows:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = dset_dicts[\"lnd.20C.monthly\"]\n",
    "print(ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s create a quick plot for a slice of the data:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.SNOW.isel(time=0, member_id=range(1, 24, 4)).plot(col=\"member_id\", col_wrap=3, robust=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake_esm  # just to display version information\n",
    "\n",
    "intake_esm.show_versions()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}