Enforce search query criteria via `require_all_on` argument#

import intake

url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
cat = intake.open_esm_datastore(url)
cat

pangeo-cmip6 catalog with 7674 dataset(s) from 514818 asset(s):

	unique
activity_id	18
institution_id	36
source_id	88
experiment_id	170
member_id	657
table_id	37
variable_id	700
grid_label	10
zstore	514818
dcpp_init_year	61
version	736
derived_variable_id	0

By default, intake-esm’s search() method returns entries that fulfill any of the criteria specified in the query. Intake-esm can return entries that fulfill all query criteria when the user supplies the require_all_on argument. The require_all_on parameter can be a dataframe column or a list of dataframe columns across which all elements must satisfy the query criteria. The require_all_on argument is best explained with the following example.

Let’s define a query for our catalog that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:

# Define our query
query = dict(
    variable_id=["thetao", "o2", "tos"],
    experiment_id=["historical", "ssp585"],
    table_id=["Omon"],
    source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)

Now, let’s use this query to search for all assets in the catalog that satisfy any combination of these requests (i.e., with require_all_on=None, which is the default):

cat_subset = cat.search(**query)
cat_subset

pangeo-cmip6 catalog with 6 dataset(s) from 143 asset(s):

	unique
activity_id	2
institution_id	3
source_id	3
experiment_id	2
member_id	30
table_id	1
variable_id	3
grid_label	1
zstore	143
dcpp_init_year	1
version	14
derived_variable_id	0

Let’s group by source_id and count unique values for a few columns:

cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()

	experiment_id	variable_id	table_id
source_id
ACCESS-ESM1-5	2	3	1
AWI-CM-1-1-MR	2	2	1
FGOALS-f3-L	2	2	1

As you can see, the search results above include source_ids for which we only have one or two of the three variables, and one of the two experiment ids.

We can tell intake-esm to discard any source_id that doesn’t have all three variables ["thetao", "o2", "tos"] and both experiments ["historical", "ssp585"] by passing require_all_on=["source_id"] to the search method:

cat_subset = cat.search(require_all_on=["source_id"], **query)
cat_subset

pangeo-cmip6 catalog with 2 dataset(s) from 119 asset(s):

	unique
activity_id	2
institution_id	1
source_id	1
experiment_id	2
member_id	30
table_id	1
variable_id	3
grid_label	1
zstore	119
dcpp_init_year	1
version	9
derived_variable_id	0

cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()

	experiment_id	variable_id	table_id
source_id
ACCESS-ESM1-5	2	3	1

Notice that with the require_all_on=["source_id"] option, the only source_id that was returned by our query was the source_id for which all of the variables and experiments were found.

Enforce search query criteria via require_all_on argument#

Enforce search query criteria via `require_all_on` argument#