{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Search and Discovery\n", "\n", "Intake-esm provides functionality to execute queries against the catalog. This\n", "notebook provided a more in-depth treatment of the search API in intake-esm,\n", "with detailed information that you can refer to when needed.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "import intake" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "catalog_url = \"https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json\"\n", "col = intake.open_esm_datastore(catalog_url)\n", "col" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col.df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exact Match Keywords\n", "\n", "The {py:meth}`~intake_esm.core.esm_datastore.search` method allows the user to\n", "perform a query on a catalog using keyword arguments. The keyword argument names\n", "must be the names of the columns in the catalog. By default, the\n", "{py:meth}`~intake_esm.core.esm_datastore.search` method looks for exact matches,\n", "and is case sensitive:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col.search(experiment=\"20C\", long_name=\"wind\").df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the example above returns an empty catalog.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Substring Matches\n", "\n", "In some cases, you may not know the exact term to look for. For such cases,\n", "inkake-esm supports searching for substring matches. With use of wildcards\n", "and/or regular expressions, we can find all items with a particular substring in\n", "a given column. Let's search for:\n", "\n", "- entries from `experiment` = '20C'\n", "- all entries whose variable long name **contains** `wind`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col.search(experiment=\"20C\", long_name=\"wind*\").df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's search for:\n", "\n", "- entries from `experiment` = '20C'\n", "- all entries whose variable long name **starts** with `wind`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col.search(experiment=\"20C\", long_name=\"^wind\").df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enforce Query Criteria via `require_all_on argument`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default intake-esm’s {py:meth}`~intake_esm.core.esm_datastore.search` method\n", "returns entries that fulfill **any** of the criteria specified in the query.\n", "Intake-esm can return entries that fulfill **all** query criteria when the user\n", "supplies the `require_all_on` argument. The `require_all_on` parameter can be a\n", "dataframe column or a list of dataframe columns across which all elements must\n", "satisfy the query criteria. The `require_all_on` argument is best explained with\n", "the following example.\n", "\n", "Let’s define a query for our collection that requests multiple variable_ids and\n", "multiple experiment_ids from the Omon table_id, all from 3 different source_ids:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "catalog_url = \"https://storage.googleapis.com/cmip6/pangeo-cmip6.json\"\n", "col = intake.open_esm_datastore(catalog_url)\n", "col" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define our query\n", "query = dict(\n", " variable_id=[\"thetao\", \"o2\"],\n", " experiment_id=[\"historical\", \"ssp245\", \"ssp585\"],\n", " table_id=[\"Omon\"],\n", " source_id=[\"ACCESS-ESM1-5\", \"AWI-CM-1-1-MR\", \"FGOALS-f3-L\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let’s use this query to search for all assets in the collection that\n", "satisfy any combination of these requests (i.e., with `require_all_on=None`,\n", "which is the default):\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col_subset = col.search(**query)\n", "col_subset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group by `source_id` and count unique values for a few columns\n", "col_subset.df.groupby(\"source_id\")[[\"experiment_id\", \"variable_id\", \"table_id\"]].nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the search results above include source_ids for which we only\n", "have one of the two variables, and one or two of the three experiments.\n", "\n", "We can tell intake-esm to discard any source_id that doesn’t have both variables\n", "`[\"thetao\", \"o2\"]` and all three experiments\n", "`[\"historical\", \"ssp245\", \"ssp585\"]` by passing `require_all_on=[\"source_id\"]`\n", "to the search method:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col_subset = col.search(require_all_on=[\"source_id\"], **query)\n", "col_subset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col_subset.df.groupby(\"source_id\")[[\"experiment_id\", \"variable_id\", \"table_id\"]].nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that with the `require_all_on=[\"source_id\"]` option, the only source_id\n", "that was returned by our query was the source_id for which all of the variables\n", "and experiments were found.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import intake_esm # just to display version information\n", "\n", "intake_esm.show_versions()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }