Filter a catalog by substring and/or regular expression#
Exact match keywords#
import intake
url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
cat = intake.open_esm_datastore(url)
cat
aws-cesm1-le catalog with 56 dataset(s) from 442 asset(s):
| unique | |
|---|---|
| variable | 79 |
| long_name | 76 |
| component | 5 |
| experiment | 4 |
| frequency | 6 |
| vertical_levels | 4 |
| spatial_domain | 5 |
| units | 26 |
| start_time | 13 |
| end_time | 14 |
| path | 427 |
| derived_variable | 0 |
cat.df.head()
| variable | long_name | component | experiment | frequency | vertical_levels | spatial_domain | units | start_time | end_time | path | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FLNS | net longwave flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.... |
| 1 | FLNSC | clearsky net longwave flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC... |
| 2 | FLUT | upwelling longwave flux at top of model | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.... |
| 3 | FSNS | net solar flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.... |
| 4 | FSNSC | clearsky net solar flux at surface | atm | 20C | daily | 1.0 | global | W/m2 | 1920-01-01 12:00:00 | 2005-12-31 12:00:00 | s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC... |
By default, the
search() method looks for exact matches,
and is case sensitive:
cat.search(experiment="20C", long_name="wind")
aws-cesm1-le catalog with 0 dataset(s) from 0 asset(s):
| unique | |
|---|---|
| variable | 0 |
| long_name | 0 |
| component | 0 |
| experiment | 0 |
| frequency | 0 |
| vertical_levels | 0 |
| spatial_domain | 0 |
| units | 0 |
| start_time | 0 |
| end_time | 0 |
| path | 0 |
| derived_variable | 0 |
As you can see, the example above returns an empty catalog.
Substring matches#
In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:
entries from
experiment= ‘20C’all entries whose variable long name contains
wind
cat.search(experiment="20C", long_name="wind*")
aws-cesm1-le catalog with 4 dataset(s) from 10 asset(s):
| unique | |
|---|---|
| variable | 8 |
| long_name | 8 |
| component | 2 |
| experiment | 1 |
| frequency | 3 |
| vertical_levels | 2 |
| spatial_domain | 2 |
| units | 3 |
| start_time | 3 |
| end_time | 3 |
| path | 10 |
| derived_variable | 0 |
Now, let’s search for:
entries from
experiment= ‘20C’all entries whose variable long name starts with
wind
cat_subset = cat.search(experiment="20C", long_name="^wind")
cat_subset
aws-cesm1-le catalog with 1 dataset(s) from 4 asset(s):
| unique | |
|---|---|
| variable | 4 |
| long_name | 4 |
| component | 1 |
| experiment | 1 |
| frequency | 1 |
| vertical_levels | 1 |
| spatial_domain | 1 |
| units | 2 |
| start_time | 1 |
| end_time | 1 |
| path | 4 |
| derived_variable | 0 |
cat_subset.df
| variable | long_name | component | experiment | frequency | vertical_levels | spatial_domain | units | start_time | end_time | path | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TAUX | windstress in grid-x direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne/centimeter^2 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
| 1 | TAUX2 | windstress**2 in grid-x direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne^2/centimeter^4 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
| 2 | TAUY | windstress in grid-y direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne/centimeter^2 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |
| 3 | TAUY2 | windstress**2 in grid-y direction | ocn | 20C | monthly | 1.0 | global_ocean | dyne^2/centimeter^4 | 1920-01-16 12:00:00 | 2005-12-16 12:00:00 | s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... |