Search and Discovery¶

Intake-esm provides functionality to execute queries against the catalog. This notebook provided a more in-depth treatment of the search API in intake-esm, with detailed information that you can refer to when needed.

import warnings

warnings.filterwarnings("ignore")
import intake

catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(catalog_url)
col

aws-cesm1-le catalog with 56 dataset(s) from 435 asset(s):

	unique
variable	77
long_name	74
component	5
experiment	4
frequency	6
vertical_levels	3
spatial_domain	5
units	25
start_time	12
end_time	13
path	420

col.df.head()

	variable	long_name	component	experiment	frequency	vertical_levels	spatial_domain	units	start_time	end_time	path
0	FLNS	net longwave flux at surface	atm	20C	daily	1.0	global	W/m2	1920-01-01 12:00:00	2005-12-31 12:00:00	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS....
1	FLNSC	clearsky net longwave flux at surface	atm	20C	daily	1.0	global	W/m2	1920-01-01 12:00:00	2005-12-31 12:00:00	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC...
2	FLUT	upwelling longwave flux at top of model	atm	20C	daily	1.0	global	W/m2	1920-01-01 12:00:00	2005-12-31 12:00:00	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT....
3	FSNS	net solar flux at surface	atm	20C	daily	1.0	global	W/m2	1920-01-01 12:00:00	2005-12-31 12:00:00	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS....
4	FSNSC	clearsky net solar flux at surface	atm	20C	daily	1.0	global	W/m2	1920-01-01 12:00:00	2005-12-31 12:00:00	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC...

Exact Match Keywords¶

The search() method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must be the names of the columns in the catalog. By default, the search() method looks for exact matches, and is case sensitive:

col.search(experiment="20C", long_name="wind").df

	variable	long_name	component	experiment	frequency	vertical_levels	spatial_domain	units	start_time	end_time	path

As you can see, the example above returns an empty catalog.

Substring Matches¶

In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:

entries from experiment = ‘20C’
all entries whose variable long name contains wind

col.search(experiment="20C", long_name="wind*").df

	variable	long_name	component	experiment	frequency	vertical_levels	spatial_domain	units	start_time	end_time	path
0	UBOT	lowest model level zonal wind	atm	20C	daily	1.0	global	m/s	1920-01-01 12:00:00	2005-12-31 12:00:00	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-UBOT....
1	WSPDSRFAV	horizontal total wind speed average at the sur...	atm	20C	daily	1.0	global	m/s	1920-01-01 12:00:00	2005-12-31 12:00:00	s3://ncar-cesm-lens/atm/daily/cesmLE-20C-WSPDS...
2	U	zonal wind	atm	20C	hourly6-1990-2005	30.0	global	m/s	1990-01-01 00:00:00	2006-01-01 00:00:00	s3://ncar-cesm-lens/atm/hourly6-1990-2005/cesm...
3	U	zonal wind	atm	20C	monthly	30.0	global	m/s	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-U.zarr
4	TAUX	windstress in grid-x direction	ocn	20C	monthly	1.0	global_ocean	dyne/centimeter^2	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
5	TAUX2	windstress**2 in grid-x direction	ocn	20C	monthly	1.0	global_ocean	dyne^2/centimeter^4	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
6	TAUY	windstress in grid-y direction	ocn	20C	monthly	1.0	global_ocean	dyne/centimeter^2	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
7	TAUY2	windstress**2 in grid-y direction	ocn	20C	monthly	1.0	global_ocean	dyne^2/centimeter^4	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...

Now, let’s search for:

entries from experiment = ‘20C’
all entries whose variable long name starts with wind

col.search(experiment="20C", long_name="^wind").df

	variable	long_name	component	experiment	frequency	vertical_levels	spatial_domain	units	start_time	end_time	path
0	TAUX	windstress in grid-x direction	ocn	20C	monthly	1.0	global_ocean	dyne/centimeter^2	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
1	TAUX2	windstress**2 in grid-x direction	ocn	20C	monthly	1.0	global_ocean	dyne^2/centimeter^4	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
2	TAUY	windstress in grid-y direction	ocn	20C	monthly	1.0	global_ocean	dyne/centimeter^2	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
3	TAUY2	windstress**2 in grid-y direction	ocn	20C	monthly	1.0	global_ocean	dyne^2/centimeter^4	1920-01-16 12:00:00	2005-12-16 12:00:00	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...

Enforce Query Criteria via `require_all_on argument`¶

By default intake-esm’s search() method returns entries that fulfill any of the criteria specified in the query. Intake-esm can return entries that fulfill all query criteria when the user supplies the require_all_on argument. The require_all_on parameter can be a dataframe column or a list of dataframe columns across which all elements must satisfy the query criteria. The require_all_on argument is best explained with the following example.

Let’s define a query for our collection that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:

catalog_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(catalog_url)
col

pangeo-cmip6 catalog with 7483 dataset(s) from 512699 asset(s):

	unique
activity_id	18
institution_id	37
source_id	87
experiment_id	172
member_id	651
table_id	38
variable_id	710
grid_label	11
zstore	512699
dcpp_init_year	60
version	684

# Define our query
query = dict(
    variable_id=["thetao", "o2"],
    experiment_id=["historical", "ssp245", "ssp585"],
    table_id=["Omon"],
    source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)

Now, let’s use this query to search for all assets in the collection that satisfy any combination of these requests (i.e., with require_all_on=None, which is the default):

col_subset = col.search(**query)
col_subset

pangeo-cmip6 catalog with 9 dataset(s) from 132 asset(s):

	unique
activity_id	2
institution_id	3
source_id	3
experiment_id	3
member_id	30
table_id	1
variable_id	2
grid_label	1
zstore	132
dcpp_init_year	0
version	17

# Group by `source_id` and count unique values for a few columns
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()

	experiment_id	variable_id	table_id
source_id
ACCESS-ESM1-5	3	2	1
AWI-CM-1-1-MR	3	1	1
FGOALS-f3-L	3	1	1

As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the three experiments.

We can tell intake-esm to discard any source_id that doesn’t have both variables ["thetao", "o2"] and all three experiments ["historical", "ssp245", "ssp585"] by passing require_all_on=["source_id"] to the search method:

col_subset = col.search(require_all_on=["source_id"], **query)
col_subset

pangeo-cmip6 catalog with 3 dataset(s) from 117 asset(s):

	unique
activity_id	2
institution_id	1
source_id	1
experiment_id	3
member_id	30
table_id	1
variable_id	2
grid_label	1
zstore	117
dcpp_init_year	0
version	11

col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()

	experiment_id	variable_id	table_id
source_id
ACCESS-ESM1-5	3	2	1

Notice that with the require_all_on=["source_id"] option, the only source_id that was returned by our query was the source_id for which all of the variables and experiments were found.

import intake_esm  # just to display version information

intake_esm.show_versions()

INSTALLED VERSIONS
------------------

cftime: 1.5.0
dask: 2021.08.0
fastprogress: 0.2.7
fsspec: 2021.07.0
gcsfs: 2021.07.0
intake: 0.6.3
intake_esm: 0.0.0
netCDF4: 1.5.7
pandas: 1.3.2
requests: 2.26.0
s3fs: 2021.07.0
xarray: 0.19.0
zarr: 2.8.3

Search and Discovery¶

Exact Match Keywords¶

Substring Matches¶

Enforce Query Criteria via require_all_on argument¶

Enforce Query Criteria via `require_all_on argument`¶