Search and Discovery

Intake-esm provides functionality to execute queries against the catalog. This notebook provided a more in-depth treatment of the search API in intake-esm, with detailed information that you can refer to when needed.

import warnings

import intake
catalog_url = ""
col = intake.open_esm_datastore(catalog_url)

aws-cesm1-le catalog with 56 dataset(s) from 429 asset(s):

component 5
frequency 6
experiment 4
variable 73
path 414
variable_long_name 70
dim_per_tstep 3
start 12
end 13
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 atm daily 20C FLNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.... net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
1 atm daily 20C FLNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC... clearsky net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
2 atm daily 20C FLUT s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.... upwelling longwave flux at top of model 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
3 atm daily 20C FSNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.... net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
4 atm daily 20C FSNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC... clearsky net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00

Exact Match Keywords

The search() method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must be the names of the columns in the catalog. By default, the search() method looks for exact matches, and is case sensitive:"20C", variable_long_name="wind")
ValueError                                Traceback (most recent call last)
~/checkouts/ in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True

~/checkouts/ in _ipython_display_(self)
    535         from IPython.display import HTML, display
--> 537         contents = self._repr_html_()
    538         display(HTML(contents))

~/checkouts/ in _repr_html_(self)
    524         Mainly for IPython notebook
    525         """
--> 526         uniques = pd.DataFrame(self.nunique(), columns=['unique'])
    527         text = uniques._repr_html_()
    528         output = f'<p><strong>{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s)</strong>:</p> {text}'

~/checkouts/ in nunique(self)
    760         """
--> 762         uniques = self.unique(self.df.columns.tolist())
    763         nuniques = {}
    764         for key, val in uniques.items():

~/checkouts/ in unique(self, columns)
    819         """
--> 820         return _unique(self.df, columns)
    822     def to_dataset_dict(

~/checkouts/ in _unique(df, columns)
     19         return uniques
---> 21     x = df[columns].apply(_find_unique, result_type='reduce').to_dict()
     22     info = {}
     23     for col in x.keys():

~/checkouts/ in apply(self, func, axis, raw, result_type, args, **kwds)
   7763             kwds=kwds,
   7764         )
-> 7765         return op.get_result()
   7767     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

~/checkouts/ in get_result(self)
    177         # one axis empty
    178         elif not all(self.obj.shape):
--> 179             return self.apply_empty_result()
    181         # raw

~/checkouts/ in apply_empty_result(self)
    216                 r = np.nan
--> 218             return self.obj._constructor_sliced(r, index=self.agg_axis)
    219         else:
    220             return self.obj.copy()

~/checkouts/ in __init__(self, data, index, dtype, name, copy, fastpath)
    320                     if len(index) != len(data):
    321                         raise ValueError(
--> 322                             f"Length of passed values is {len(data)}, "
    323                             f"index implies {len(index)}."
    324                         )

ValueError: Length of passed values is 0, index implies 9.
ValueError                                Traceback (most recent call last)
~/checkouts/ in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

~/checkouts/ in _repr_html_(self)
    524         Mainly for IPython notebook
    525         """
--> 526         uniques = pd.DataFrame(self.nunique(), columns=['unique'])
    527         text = uniques._repr_html_()
    528         output = f'<p><strong>{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s)</strong>:</p> {text}'

~/checkouts/ in nunique(self)
    760         """
--> 762         uniques = self.unique(self.df.columns.tolist())
    763         nuniques = {}
    764         for key, val in uniques.items():

~/checkouts/ in unique(self, columns)
    819         """
--> 820         return _unique(self.df, columns)
    822     def to_dataset_dict(

~/checkouts/ in _unique(df, columns)
     19         return uniques
---> 21     x = df[columns].apply(_find_unique, result_type='reduce').to_dict()
     22     info = {}
     23     for col in x.keys():

~/checkouts/ in apply(self, func, axis, raw, result_type, args, **kwds)
   7763             kwds=kwds,
   7764         )
-> 7765         return op.get_result()
   7767     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

~/checkouts/ in get_result(self)
    177         # one axis empty
    178         elif not all(self.obj.shape):
--> 179             return self.apply_empty_result()
    181         # raw

~/checkouts/ in apply_empty_result(self)
    216                 r = np.nan
--> 218             return self.obj._constructor_sliced(r, index=self.agg_axis)
    219         else:
    220             return self.obj.copy()

~/checkouts/ in __init__(self, data, index, dtype, name, copy, fastpath)
    320                     if len(index) != len(data):
    321                         raise ValueError(
--> 322                             f"Length of passed values is {len(data)}, "
    323                             f"index implies {len(index)}."
    324                         )

ValueError: Length of passed values is 0, index implies 9.
<aws-cesm1-le catalog with 0 dataset(s) from 0 asset(s)>

As you can see, the example above returns an empty catalog.

Substring Matches

In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:

  • entries from experiment = ‘20C’

  • all entries whose variable long name contains wind"20C", variable_long_name="wind*").df
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 atm daily 20C UBOT s3://ncar-cesm-lens/atm/daily/cesmLE-20C-UBOT.... lowest model level zonal wind 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
1 atm daily 20C WSPDSRFAV s3://ncar-cesm-lens/atm/daily/cesmLE-20C-WSPDS... horizontal total wind speed average at the sur... 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
2 atm hourly6-1990-2005 20C U s3://ncar-cesm-lens/atm/hourly6-1990-2005/cesm... zonal wind 3.0 1990-01-01 00:00:00 2006-01-01 00:00:00
3 atm monthly 20C U s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-U.zarr zonal wind 3.0 1920-01-16 12:00:00 2005-12-16 12:00:00
4 ocn monthly 20C TAUX s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
5 ocn monthly 20C TAUX2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
6 ocn monthly 20C TAUY s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
7 ocn monthly 20C TAUY2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00

Now, let’s search for:

  • entries from experiment = ‘20C’

  • all entries whose variable long name starts with wind"20C", variable_long_name="^wind").df
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 ocn monthly 20C TAUX s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
1 ocn monthly 20C TAUX2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
2 ocn monthly 20C TAUY s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
3 ocn monthly 20C TAUY2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00

Enforce Query Criteria via require_all_on argument

By default intake-esm’s search() method returns entries that fulfill any of the criteria specified in the query. Intake-esm can return entries that fulfill all query criteria when the user supplies the require_all_on argument. The require_all_on parameter can be a dataframe column or a list of dataframe columns across which all elements must satisfy the query criteria. The require_all_on argument is best explained with the following example.

Let’s define a query for our collection that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:

catalog_url = (
col = intake.open_esm_datastore(catalog_url)

pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):

activity_id 17
institution_id 35
source_id 84
experiment_id 160
member_id 549
table_id 37
variable_id 707
grid_label 10
zstore 402033
dcpp_init_year 60
version 802
# Define our query
query = dict(
    variable_id=["thetao", "o2"],
    experiment_id=["historical", "ssp245", "ssp585"],
    source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],

Now, let’s use this query to search for all assets in the collection that satisfy any combination of these requests (i.e., with require_all_on=None, which is the default):

col_subset =**query)

pangeo-cmip6 catalog with 8 dataset(s) from 76 asset(s):

activity_id 2
institution_id 3
source_id 3
experiment_id 3
member_id 20
table_id 1
variable_id 2
grid_label 1
zstore 76
dcpp_init_year 0
version 15
# Group by `source_id` and count unique values for a few columns
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id variable_id table_id
ACCESS-ESM1-5 3 2 1
AWI-CM-1-1-MR 3 1 1
FGOALS-f3-L 2 1 1

As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the three experiments.

We can tell intake-esm to discard any source_id that doesn’t have both variables ["thetao", "o2"] and all three experiments ["historical", "ssp245", "ssp585"] by passing require_all_on=["source_id"] to the search method:

col_subset =["source_id"], **query)

pangeo-cmip6 catalog with 3 dataset(s) from 63 asset(s):

activity_id 2
institution_id 1
source_id 1
experiment_id 3
member_id 20
table_id 1
variable_id 2
grid_label 1
zstore 63
dcpp_init_year 0
version 9
col_subset.df.groupby("source_id")[["experiment_id", "variable_id", "table_id"]].nunique()
experiment_id variable_id table_id
ACCESS-ESM1-5 3 2 1

Notice that with the require_all_on=["source_id"] option, the only source_id that was returned by our query was the source_id for which all of the variables and experiments were found.