Enforce search query criteria via require_all_on argument#
import intake
url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
cat = intake.open_esm_datastore(url)
cat
pangeo-cmip6 catalog with 7674 dataset(s) from 514818 asset(s):
| unique | |
|---|---|
| activity_id | 18 |
| institution_id | 36 |
| source_id | 88 |
| experiment_id | 170 |
| member_id | 657 |
| table_id | 37 |
| variable_id | 700 |
| grid_label | 10 |
| zstore | 514818 |
| dcpp_init_year | 61 |
| version | 736 |
| derived_variable_id | 0 |
By default, intake-esm’s search() method
returns entries that fulfill any of the criteria specified in the query.
Intake-esm can return entries that fulfill all query criteria when the user
supplies the require_all_on argument. The require_all_on parameter can be a
dataframe column or a list of dataframe columns across which all elements must
satisfy the query criteria. The require_all_on argument is best explained with
the following example.
Let’s define a query for our catalog that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:
# Define our query
query = dict(
variable_id=["thetao", "o2", "tos"],
experiment_id=["historical", "ssp585"],
table_id=["Omon"],
source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)
Now, let’s use this query to search for all assets in the catalog that
satisfy any combination of these requests (i.e., with require_all_on=None,
which is the default):
cat_subset = cat.search(**query)
cat_subset
pangeo-cmip6 catalog with 6 dataset(s) from 143 asset(s):
| unique | |
|---|---|
| activity_id | 2 |
| institution_id | 3 |
| source_id | 3 |
| experiment_id | 2 |
| member_id | 30 |
| table_id | 1 |
| variable_id | 3 |
| grid_label | 1 |
| zstore | 143 |
| dcpp_init_year | 1 |
| version | 14 |
| derived_variable_id | 0 |
Let’s group by source_id and count unique values for a few columns:
cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
| experiment_id | variable_id | table_id | |
|---|---|---|---|
| source_id | |||
| ACCESS-ESM1-5 | 2 | 3 | 1 |
| AWI-CM-1-1-MR | 2 | 2 | 1 |
| FGOALS-f3-L | 2 | 2 | 1 |
As you can see, the search results above include source_ids for which we only have one or two of the three variables, and one of the two experiment ids.
We can tell intake-esm to discard any source_id that doesn’t have all three variables
["thetao", "o2", "tos"] and both experiments
["historical", "ssp585"] by passing require_all_on=["source_id"]
to the search method:
cat_subset = cat.search(require_all_on=["source_id"], **query)
cat_subset
pangeo-cmip6 catalog with 2 dataset(s) from 119 asset(s):
| unique | |
|---|---|
| activity_id | 2 |
| institution_id | 1 |
| source_id | 1 |
| experiment_id | 2 |
| member_id | 30 |
| table_id | 1 |
| variable_id | 3 |
| grid_label | 1 |
| zstore | 119 |
| dcpp_init_year | 1 |
| version | 9 |
| derived_variable_id | 0 |
cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
| experiment_id | variable_id | table_id | |
|---|---|---|---|
| source_id | |||
| ACCESS-ESM1-5 | 2 | 3 | 1 |
Notice that with the require_all_on=["source_id"] option, the only source_id
that was returned by our query was the source_id for which all of the variables
and experiments were found.