# Search and Discovery

Intake-esm provides functionality to execute queries against the catalog. This
notebook provided a more in-depth treatment of the search API in intake-esm,
with detailed information that you can refer to when needed.


In [None]:
import warnings

warnings.filterwarnings("ignore")
import intake

In [None]:
catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(catalog_url)
col

In [None]:
col.df.head()

## Exact Match Keywords

The {py:meth}`~intake_esm.core.esm_datastore.search` method allows the user to
perform a query on a catalog using keyword arguments. The keyword argument names
must be the names of the columns in the catalog. By default, the
{py:meth}`~intake_esm.core.esm_datastore.search` method looks for exact matches,
and is case sensitive:


In [None]:
col.search(experiment="20C", long_name="wind").df

As you can see, the example above returns an empty catalog.


## Substring Matches

In some cases, you may not know the exact term to look for. For such cases,
inkake-esm supports searching for substring matches. With use of wildcards
and/or regular expressions, we can find all items with a particular substring in
a given column. Let's search for:

- entries from `experiment` = '20C'
- all entries whose variable long name **contains** `wind`


In [None]:
col.search(experiment="20C", long_name="wind*").df

Now, let's search for:

- entries from `experiment` = '20C'
- all entries whose variable long name **starts** with `wind`


In [None]:
col.search(experiment="20C", long_name="^wind").df

## Enforce Query Criteria via `require_all_on argument`


By default intake-esm’s {py:meth}`~intake_esm.core.esm_datastore.search` method
returns entries that fulfill **any** of the criteria specified in the query.
Intake-esm can return entries that fulfill **all** query criteria when the user
supplies the `require_all_on` argument. The `require_all_on` parameter can be a
dataframe column or a list of dataframe columns across which all elements must
satisfy the query criteria. The `require_all_on` argument is best explained with
the following example.

Let’s define a query for our collection that requests multiple variable_ids and
multiple experiment_ids from the Omon table_id, all from 3 different source_ids:


In [None]:
catalog_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(catalog_url)
col

In [None]:
# Define our query
query = dict(
 variable_id=["thetao", "o2"],
 experiment_id=["historical", "ssp245", "ssp585"],
 table_id=["Omon"],
 source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)

Now, let’s use this query to search for all assets in the collection that
satisfy any combination of these requests (i.e., with `require_all_on=None`,
which is the default):


In [None]:
col_subset = col.search(**query)
col_subset

In [None]:
# Group by `source_id` and count unique values for a few columns
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()

As you can see, the search results above include source_ids for which we only
have one of the two variables, and one or two of the three experiments.

We can tell intake-esm to discard any source_id that doesn’t have both variables
`["thetao", "o2"]` and all three experiments
`["historical", "ssp245", "ssp585"]` by passing `require_all_on=["source_id"]`
to the search method:


In [None]:
col_subset = col.search(require_all_on=["source_id"], **query)
col_subset

In [None]:
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()

Notice that with the `require_all_on=["source_id"]` option, the only source_id
that was returned by our query was the source_id for which all of the variables
and experiments were found.


In [None]:
import intake_esm # just to display version information

intake_esm.show_versions()