Enforce search query criteria via require_all_on
argument#
import intake
url = "https://raw.githubusercontent.com/intake/intake-esm/main/tutorial-catalogs/GOOGLE-CMIP6.json"
cat = intake.open_esm_datastore(url)
cat
GOOGLE-CMIP6 catalog with 4 dataset(s) from 261 asset(s):
unique | |
---|---|
activity_id | 1 |
institution_id | 2 |
source_id | 2 |
experiment_id | 1 |
member_id | 72 |
table_id | 2 |
variable_id | 3 |
grid_label | 2 |
zstore | 261 |
dcpp_init_year | 0 |
version | 4 |
derived_variable_id | 0 |
By default, intake-esm’s search()
method
returns entries that fulfill any of the criteria specified in the query.
Intake-esm can return entries that fulfill all query criteria when the user
supplies the require_all_on
argument. The require_all_on
parameter can be a
dataframe column or a list of dataframe columns across which all elements must
satisfy the query criteria. The require_all_on
argument is best explained with
the following example.
Let’s define a query for our catalog that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:
# Define our query
query = dict(
variable_id=["tos", "o2"],
experiment_id=["historical", "ssp585"],
table_id=["Omon"],
source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)
Now, let’s use this query to search for all assets in the catalog that
satisfy any combination of these requests (i.e., with require_all_on=None
,
which is the default):
cat_subset = cat.search(**query)
cat_subset
GOOGLE-CMIP6 catalog with 0 dataset(s) from 0 asset(s):
unique | |
---|---|
activity_id | 0 |
institution_id | 0 |
source_id | 0 |
experiment_id | 0 |
member_id | 0 |
table_id | 0 |
variable_id | 0 |
grid_label | 0 |
zstore | 0 |
dcpp_init_year | 0 |
version | 0 |
derived_variable_id | 0 |
Let’s group by source_id
and count unique values for a few columns:
cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id | variable_id | table_id | |
---|---|---|---|
source_id |
As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the two
We can tell intake-esm to discard any source_id that doesn’t have both variables
["tos", "o2"]
and both experiments
["historical", "ssp585"]
by passing require_all_on=["source_id"]
to the search method:
cat_subset = cat.search(require_all_on=["source_id"], **query)
cat_subset
GOOGLE-CMIP6 catalog with 0 dataset(s) from 0 asset(s):
unique | |
---|---|
activity_id | 0 |
institution_id | 0 |
source_id | 0 |
experiment_id | 0 |
member_id | 0 |
table_id | 0 |
variable_id | 0 |
grid_label | 0 |
zstore | 0 |
dcpp_init_year | 0 |
version | 0 |
derived_variable_id | 0 |
cat_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id | variable_id | table_id | |
---|---|---|---|
source_id |
Notice that with the require_all_on=["source_id"]
option, the only source_id
that was returned by our query was the source_id for which all of the variables
and experiments were found.
Show code cell source
import intake_esm # just to display version information
intake_esm.show_versions()
Show code cell output
INSTALLED VERSIONS
------------------
cftime: 1.6.2
dask: 2023.7.0
fastprogress: 1.0.3
fsspec: 2023.6.0
gcsfs: 2023.6.0
intake: 0.7.0
intake_esm: 2023.7.7.post2+dirty
netCDF4: 1.6.4
pandas: 2.0.3
requests: 2.31.0
s3fs: 2023.6.0
xarray: 2023.6.0
zarr: 2.15.0