{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Search and Discovery\n",
    "\n",
    "Intake-esm provides functionality to execute queries against the catalog. This\n",
    "notebook provided a more in-depth treatment of the search API in intake-esm,\n",
    "with detailed information that you can refer to when needed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import intake"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "catalog_url = \"https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json\"\n",
    "col = intake.open_esm_datastore(catalog_url)\n",
    "col"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exact Match Keywords\n",
    "\n",
    "The {py:meth}`~intake_esm.core.esm_datastore.search` method allows the user to\n",
    "perform a query on a catalog using keyword arguments. The keyword argument names\n",
    "must be the names of the columns in the catalog. By default, the\n",
    "{py:meth}`~intake_esm.core.esm_datastore.search` method looks for exact matches,\n",
    "and is case sensitive:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.search(experiment=\"20C\", long_name=\"wind\").df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the example above returns an empty catalog.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Substring Matches\n",
    "\n",
    "In some cases, you may not know the exact term to look for. For such cases,\n",
    "inkake-esm supports searching for substring matches. With use of wildcards\n",
    "and/or regular expressions, we can find all items with a particular substring in\n",
    "a given column. Let's search for:\n",
    "\n",
    "- entries from `experiment` = '20C'\n",
    "- all entries whose variable long name **contains** `wind`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.search(experiment=\"20C\", long_name=\"wind*\").df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's search for:\n",
    "\n",
    "- entries from `experiment` = '20C'\n",
    "- all entries whose variable long name **starts** with `wind`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.search(experiment=\"20C\", long_name=\"^wind\").df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Enforce Query Criteria via `require_all_on argument`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default intake-esm’s {py:meth}`~intake_esm.core.esm_datastore.search` method\n",
    "returns entries that fulfill **any** of the criteria specified in the query.\n",
    "Intake-esm can return entries that fulfill **all** query criteria when the user\n",
    "supplies the `require_all_on` argument. The `require_all_on` parameter can be a\n",
    "dataframe column or a list of dataframe columns across which all elements must\n",
    "satisfy the query criteria. The `require_all_on` argument is best explained with\n",
    "the following example.\n",
    "\n",
    "Let’s define a query for our collection that requests multiple variable_ids and\n",
    "multiple experiment_ids from the Omon table_id, all from 3 different source_ids:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "catalog_url = \"https://storage.googleapis.com/cmip6/pangeo-cmip6.json\"\n",
    "col = intake.open_esm_datastore(catalog_url)\n",
    "col"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define our query\n",
    "query = dict(\n",
    "    variable_id=[\"thetao\", \"o2\"],\n",
    "    experiment_id=[\"historical\", \"ssp245\", \"ssp585\"],\n",
    "    table_id=[\"Omon\"],\n",
    "    source_id=[\"ACCESS-ESM1-5\", \"AWI-CM-1-1-MR\", \"FGOALS-f3-L\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let’s use this query to search for all assets in the collection that\n",
    "satisfy any combination of these requests (i.e., with `require_all_on=None`,\n",
    "which is the default):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_subset = col.search(**query)\n",
    "col_subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Group by `source_id` and count unique values for a few columns\n",
    "col_subset.df.groupby(\"source_id\")[[\"experiment_id\", \"variable_id\", \"table_id\"]].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the search results above include source_ids for which we only\n",
    "have one of the two variables, and one or two of the three experiments.\n",
    "\n",
    "We can tell intake-esm to discard any source_id that doesn’t have both variables\n",
    "`[\"thetao\", \"o2\"]` and all three experiments\n",
    "`[\"historical\", \"ssp245\", \"ssp585\"]` by passing `require_all_on=[\"source_id\"]`\n",
    "to the search method:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_subset = col.search(require_all_on=[\"source_id\"], **query)\n",
    "col_subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_subset.df.groupby(\"source_id\")[[\"experiment_id\", \"variable_id\", \"table_id\"]].nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that with the `require_all_on=[\"source_id\"]` option, the only source_id\n",
    "that was returned by our query was the source_id for which all of the variables\n",
    "and experiments were found.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake_esm  # just to display version information\n",
    "\n",
    "intake_esm.show_versions()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}