Search and Discovery¶
Intake-esm provides functionality to execute queries against the catalog. This notebook provided a more in-depth treatment of the search API in intake-esm, with detailed information that you can refer to when needed.
import warnings
warnings.filterwarnings("ignore")
import intake
catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(catalog_url)
col
aws-cesm1-le catalog with 56 dataset(s) from 435 asset(s):
unique | |
---|---|
variable | 77 |
long_name | 74 |
component | 5 |
experiment | 4 |
frequency | 6 |
vertical_levels | 3 |
spatial_domain | 5 |
units | 25 |
start_time | 12 |
end_time | 13 |
path | 420 |
col.df.head()
variable | long_name | component | experiment | frequency | vertical_levels | spatial_domain | units | start_time | end_time | path | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | FLNS | net longwave flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.... |
1 | FLNSC | clearsky net longwave flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC... |
2 | FLUT | upwelling longwave flux at top of model | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.... |
3 | FSNS | net solar flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.... |
4 | FSNSC | clearsky net solar flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC... |
Exact Match Keywords¶
The search()
method allows the user to
perform a query on a catalog using keyword arguments. The keyword argument names
must be the names of the columns in the catalog. By default, the
search()
method looks for exact matches,
and is case sensitive:
col.search(experiment="20C", long_name="wind").df
variable | long_name | component | experiment | frequency | vertical_levels | spatial_domain | units | start_time | end_time | path |
---|
As you can see, the example above returns an empty catalog.
Substring Matches¶
In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:
entries from
experiment
= ‘20C’all entries whose variable long name contains
wind
col.search(experiment="20C", long_name="wind*").df
variable | long_name | component | experiment | frequency | vertical_levels | spatial_domain | units | start_time | end_time | path | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | UBOT | lowest model level zonal wind | atm | 20C | daily | 1.0 | global | m/s | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-UBOT.... |
1 | WSPDSRFAV | horizontal total wind speed average at the sur... | atm | 20C | daily | 1.0 | global | m/s | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-WSPDS... |
2 | U | zonal wind | atm | 20C | hourly6-1990-2005 | 30.0 | global | m/s | 1990-01-01 00:00:00 | 2006-01-01 00:00:00 | s3://ncar-cesm-lens/atm/hourly6-1990-2005/cesm... |
3 | U | zonal wind | atm | 20C | monthly | 30.0 | global | m/s | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-U.zarr |
4 | TAUX | windstress in grid-x direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne/centimeter^2 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
5 | TAUX2 | windstress**2 in grid-x direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne^2/centimeter^4 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
6 | TAUY | windstress in grid-y direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne/centimeter^2 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
7 | TAUY2 | windstress**2 in grid-y direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne^2/centimeter^4 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
Now, let’s search for:
entries from
experiment
= ‘20C’all entries whose variable long name starts with
wind
col.search(experiment="20C", long_name="^wind").df
variable | long_name | component | experiment | frequency | vertical_levels | spatial_domain | units | start_time | end_time | path | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | TAUX | windstress in grid-x direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne/centimeter^2 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
1 | TAUX2 | windstress**2 in grid-x direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne^2/centimeter^4 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
2 | TAUY | windstress in grid-y direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne/centimeter^2 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
3 | TAUY2 | windstress**2 in grid-y direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne^2/centimeter^4 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
Enforce Query Criteria via require_all_on argument
¶
By default intake-esm’s search()
method
returns entries that fulfill any of the criteria specified in the query.
Intake-esm can return entries that fulfill all query criteria when the user
supplies the require_all_on
argument. The require_all_on
parameter can be a
dataframe column or a list of dataframe columns across which all elements must
satisfy the query criteria. The require_all_on
argument is best explained with
the following example.
Let’s define a query for our collection that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:
catalog_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(catalog_url)
col
pangeo-cmip6 catalog with 7483 dataset(s) from 512699 asset(s):
unique | |
---|---|
activity_id | 18 |
institution_id | 37 |
source_id | 87 |
experiment_id | 172 |
member_id | 651 |
table_id | 38 |
variable_id | 710 |
grid_label | 11 |
zstore | 512699 |
dcpp_init_year | 60 |
version | 684 |
# Define our query
query = dict(
variable_id=["thetao", "o2"],
experiment_id=["historical", "ssp245", "ssp585"],
table_id=["Omon"],
source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)
Now, let’s use this query to search for all assets in the collection that
satisfy any combination of these requests (i.e., with require_all_on=None
,
which is the default):
col_subset = col.search(**query)
col_subset
pangeo-cmip6 catalog with 9 dataset(s) from 132 asset(s):
unique | |
---|---|
activity_id | 2 |
institution_id | 3 |
source_id | 3 |
experiment_id | 3 |
member_id | 30 |
table_id | 1 |
variable_id | 2 |
grid_label | 1 |
zstore | 132 |
dcpp_init_year | 0 |
version | 17 |
# Group by `source_id` and count unique values for a few columns
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id | variable_id | table_id | |
---|---|---|---|
source_id | |||
ACCESS-ESM1-5 | 3 | 2 | 1 |
AWI-CM-1-1-MR | 3 | 1 | 1 |
FGOALS-f3-L | 3 | 1 | 1 |
As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the three experiments.
We can tell intake-esm to discard any source_id that doesn’t have both variables
["thetao", "o2"]
and all three experiments
["historical", "ssp245", "ssp585"]
by passing require_all_on=["source_id"]
to the search method:
col_subset = col.search(require_all_on=["source_id"], **query)
col_subset
pangeo-cmip6 catalog with 3 dataset(s) from 117 asset(s):
unique | |
---|---|
activity_id | 2 |
institution_id | 1 |
source_id | 1 |
experiment_id | 3 |
member_id | 30 |
table_id | 1 |
variable_id | 2 |
grid_label | 1 |
zstore | 117 |
dcpp_init_year | 0 |
version | 11 |
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id | variable_id | table_id | |
---|---|---|---|
source_id | |||
ACCESS-ESM1-5 | 3 | 2 | 1 |
Notice that with the require_all_on=["source_id"]
option, the only source_id
that was returned by our query was the source_id for which all of the variables
and experiments were found.
import intake_esm # just to display version information
intake_esm.show_versions()
INSTALLED VERSIONS
------------------
cftime: 1.5.0
dask: 2021.08.0
fastprogress: 0.2.7
fsspec: 2021.07.0
gcsfs: 2021.07.0
intake: 0.6.3
intake_esm: 0.0.0
netCDF4: 1.5.7
pandas: 1.3.2
requests: 2.26.0
s3fs: 2021.07.0
xarray: 0.19.0
zarr: 2.8.3