Search and Discovery¶

Intake-esm provides functionality to execute queries against the catalog. This notebook provided a more in-depth treatment of the search API in intake-esm, with detailed information that you can refer to when needed.

import warnings

warnings.filterwarnings("ignore")
import intake

catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(catalog_url)
col

aws-cesm1-le catalog with 56 dataset(s) from 429 asset(s):

	unique
component	5
frequency	6
experiment	4
variable	73
path	414
variable_long_name	70
dim_per_tstep	3
start	12
end	13

col.df.head()

	component	frequency	experiment	variable	path	variable_long_name	dim_per_tstep	start	end
0	atm	daily	20C	FLNS	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS....	net longwave flux at surface	2.0	1920-01-01 12:00:00	2005-12-31 12:00:00
1	atm	daily	20C	FLNSC	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC...	clearsky net longwave flux at surface	2.0	1920-01-01 12:00:00	2005-12-31 12:00:00
2	atm	daily	20C	FLUT	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT....	upwelling longwave flux at top of model	2.0	1920-01-01 12:00:00	2005-12-31 12:00:00
3	atm	daily	20C	FSNS	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS....	net solar flux at surface	2.0	1920-01-01 12:00:00	2005-12-31 12:00:00
4	atm	daily	20C	FSNSC	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC...	clearsky net solar flux at surface	2.0	1920-01-01 12:00:00	2005-12-31 12:00:00

Exact Match Keywords¶

The search() method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must be the names of the columns in the catalog. By default, the search() method looks for exact matches, and is case sensitive:

col.search(experiment="20C", variable_long_name="wind")

aws-cesm1-le catalog with 0 dataset(s) from 0 asset(s):

	unique
component	0
frequency	0
experiment	0
variable	0
path	0
variable_long_name	0
dim_per_tstep	0
start	0
end	0

As you can see, the example above returns an empty catalog.

Substring Matches¶

In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:

entries from experiment = ‘20C’
all entries whose variable long name contains wind

col.search(experiment="20C", variable_long_name="wind*").df

	component	frequency	experiment	variable	path	variable_long_name	dim_per_tstep	start	end
0	atm	daily	20C	UBOT	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-UBOT....	lowest model level zonal wind	2.0	1920-01-01 12:00:00	2005-12-31 12:00:00
1	atm	daily	20C	WSPDSRFAV	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-WSPDS...	horizontal total wind speed average at the sur...	2.0	1920-01-01 12:00:00	2005-12-31 12:00:00
2	atm	hourly6-1990-2005	20C	U	s3://ncar-cesm-lens/atm/hourly6-1990-2005/cesm...	zonal wind	3.0	1990-01-01 00:00:00	2006-01-01 00:00:00
3	atm	monthly	20C	U	s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-U.zarr	zonal wind	3.0	1920-01-16 12:00:00	2005-12-16 12:00:00
4	ocn	monthly	20C	TAUX	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress in grid-x direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00
5	ocn	monthly	20C	TAUX2	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress**2 in grid-x direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00
6	ocn	monthly	20C	TAUY	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress in grid-y direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00
7	ocn	monthly	20C	TAUY2	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress**2 in grid-y direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00

Now, let’s search for:

entries from experiment = ‘20C’
all entries whose variable long name starts with wind

col.search(experiment="20C", variable_long_name="^wind").df

	component	frequency	experiment	variable	path	variable_long_name	dim_per_tstep	start	end
0	ocn	monthly	20C	TAUX	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress in grid-x direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00
1	ocn	monthly	20C	TAUX2	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress**2 in grid-x direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00
2	ocn	monthly	20C	TAUY	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress in grid-y direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00
3	ocn	monthly	20C	TAUY2	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...	windstress**2 in grid-y direction	2.0	1920-01-16 12:00:00	2005-12-16 12:00:00

Enforce Query Criteria via `require_all_on argument`¶

By default intake-esm’s search() method returns entries that fulfill any of the criteria specified in the query. Intake-esm can return entries that fulfill all query criteria when the user supplies the require_all_on argument. The require_all_on parameter can be a dataframe column or a list of dataframe columns across which all elements must satisfy the query criteria. The require_all_on argument is best explained with the following example.

Let’s define a query for our collection that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:

catalog_url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(catalog_url)
col

pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):

	unique
activity_id	17
institution_id	35
source_id	84
experiment_id	160
member_id	549
table_id	37
variable_id	707
grid_label	10
zstore	402033
dcpp_init_year	60
version	606

# Define our query
query = dict(
    variable_id=["thetao", "o2"],
    experiment_id=["historical", "ssp245", "ssp585"],
    table_id=["Omon"],
    source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)

Now, let’s use this query to search for all assets in the collection that satisfy any combination of these requests (i.e., with require_all_on=None, which is the default):

col_subset = col.search(**query)
col_subset

pangeo-cmip6 catalog with 8 dataset(s) from 76 asset(s):

	unique
activity_id	2
institution_id	3
source_id	3
experiment_id	3
member_id	20
table_id	1
variable_id	2
grid_label	1
zstore	76
dcpp_init_year	0
version	14

# Group by `source_id` and count unique values for a few columns
col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id"]
].nunique()

	experiment_id	variable_id	table_id
source_id
ACCESS-ESM1-5	3	2	1
AWI-CM-1-1-MR	3	1	1
FGOALS-f3-L	2	1	1

As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the three experiments.

We can tell intake-esm to discard any source_id that doesn’t have both variables ["thetao", "o2"] and all three experiments ["historical", "ssp245", "ssp585"] by passing require_all_on=["source_id"] to the search method:

col_subset = col.search(require_all_on=["source_id"], **query)
col_subset

pangeo-cmip6 catalog with 3 dataset(s) from 63 asset(s):

	unique
activity_id	2
institution_id	1
source_id	1
experiment_id	3
member_id	20
table_id	1
variable_id	2
grid_label	1
zstore	63
dcpp_init_year	0
version	9

col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id"]
].nunique()

	experiment_id	variable_id	table_id
source_id
ACCESS-ESM1-5	3	2	1

Notice that with the require_all_on=["source_id"] option, the only source_id that was returned by our query was the source_id for which all of the variables and experiments were found.

Overview Working with multi-variable assets

Search and Discovery¶

Exact Match Keywords¶

Substring Matches¶

Enforce Query Criteria via require_all_on argument¶

Enforce Query Criteria via `require_all_on argument`¶