Enforce search query criteria via require_all_on argument#

import intake

url = "https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
cat = intake.open_esm_datastore(url)
cat

GOOGLE-CMIP6 catalog with 4 dataset(s) from 261 asset(s):

unique
activity_id 1
institution_id 2
source_id 2
experiment_id 1
member_id 72
table_id 2
variable_id 3
grid_label 2
zstore 261
dcpp_init_year 0
version 4
derived_variable_id 0

By default, intake-esm’s search() method returns entries that fulfill any of the criteria specified in the query. Intake-esm can return entries that fulfill all query criteria when the user supplies the require_all_on argument. The require_all_on parameter can be a dataframe column or a list of dataframe columns across which all elements must satisfy the query criteria. The require_all_on argument is best explained with the following example.

Let’s define a query for our catalog that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:

# Define our query
query = dict(
    variable_id=["tos", "o2"],
    experiment_id=["historical", "ssp585"],
    table_id=["Omon"],
    source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)

Now, let’s use this query to search for all assets in the catalog that satisfy any combination of these requests (i.e., with require_all_on=None, which is the default):

cat_subset = cat.search(**query)
cat_subset

GOOGLE-CMIP6 catalog with 0 dataset(s) from 0 asset(s):

unique
activity_id 0
institution_id 0
source_id 0
experiment_id 0
member_id 0
table_id 0
variable_id 0
grid_label 0
zstore 0
dcpp_init_year 0
version 0
derived_variable_id 0

Let’s group by source_id and count unique values for a few columns:

cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id variable_id table_id
source_id

As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the two

We can tell intake-esm to discard any source_id that doesn’t have both variables ["tos", "o2"] and both experiments ["historical", "ssp585"] by passing require_all_on=["source_id"] to the search method:

cat_subset = cat.search(require_all_on=["source_id"], **query)
cat_subset

GOOGLE-CMIP6 catalog with 0 dataset(s) from 0 asset(s):

unique
activity_id 0
institution_id 0
source_id 0
experiment_id 0
member_id 0
table_id 0
variable_id 0
grid_label 0
zstore 0
dcpp_init_year 0
version 0
derived_variable_id 0
cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id variable_id table_id
source_id

Notice that with the require_all_on=["source_id"] option, the only source_id that was returned by our query was the source_id for which all of the variables and experiments were found.

import intake_esm  # just to display version information
intake_esm.show_versions()
INSTALLED VERSIONS
------------------

cftime: 1.6.2
dask: 2022.6.1
fastprogress: 1.0.3
fsspec: 2022.8.2
gcsfs: 2022.8.2
intake: 0.6.6
intake_esm: 2022.9.18.post4+dirty
netCDF4: 1.6.1
pandas: 1.5.0
requests: 2.28.1
s3fs: 2022.8.2
xarray: 2022.6.0
zarr: 2.12.0