Filter a catalog by substring and/or regular expression#

Exact match keywords#

import intake

url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
cat = intake.open_esm_datastore(url)
cat

aws-cesm1-le catalog with 56 dataset(s) from 442 asset(s):

unique
variable 78
long_name 75
component 5
experiment 4
frequency 6
vertical_levels 3
spatial_domain 5
units 25
start_time 12
end_time 13
path 427
derived_variable 0
cat.df.head()
variable long_name component experiment frequency vertical_levels spatial_domain units start_time end_time path
0 FLNS net longwave flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS....
1 FLNSC clearsky net longwave flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC...
2 FLUT upwelling longwave flux at top of model atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT....
3 FSNS net solar flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS....
4 FSNSC clearsky net solar flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC...

By default, the search() method looks for exact matches, and is case sensitive:

cat.search(experiment="20C", long_name="wind")

aws-cesm1-le catalog with 0 dataset(s) from 0 asset(s):

unique
variable 0
long_name 0
component 0
experiment 0
frequency 0
vertical_levels 0
spatial_domain 0
units 0
start_time 0
end_time 0
path 0
derived_variable 0

As you can see, the example above returns an empty catalog.

Substring matches#

In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:

  • entries from experiment = ‘20C’

  • all entries whose variable long name contains wind

cat.search(experiment="20C", long_name="wind*")

aws-cesm1-le catalog with 4 dataset(s) from 10 asset(s):

unique
variable 8
long_name 8
component 2
experiment 1
frequency 3
vertical_levels 2
spatial_domain 2
units 3
start_time 3
end_time 3
path 10
derived_variable 0

Now, let’s search for:

  • entries from experiment = ‘20C’

  • all entries whose variable long name starts with wind

cat_subset = cat.search(experiment="20C", long_name="^wind")
cat_subset

aws-cesm1-le catalog with 1 dataset(s) from 4 asset(s):

unique
variable 4
long_name 4
component 1
experiment 1
frequency 1
vertical_levels 1
spatial_domain 1
units 2
start_time 1
end_time 1
path 4
derived_variable 0
cat_subset.df
variable long_name component experiment frequency vertical_levels spatial_domain units start_time end_time path
0 TAUX windstress in grid-x direction ocn 20C monthly 1.0 global_ocean dyne/centimeter^2 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
1 TAUX2 windstress**2 in grid-x direction ocn 20C monthly 1.0 global_ocean dyne^2/centimeter^4 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
2 TAUY windstress in grid-y direction ocn 20C monthly 1.0 global_ocean dyne/centimeter^2 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
3 TAUY2 windstress**2 in grid-y direction ocn 20C monthly 1.0 global_ocean dyne^2/centimeter^4 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU...
import intake_esm  # just to display version information
intake_esm.show_versions()
INSTALLED VERSIONS
------------------

cftime: 1.6.2
dask: 2022.6.1
fastprogress: 1.0.3
fsspec: 2022.8.2
gcsfs: 2022.8.2
intake: 0.6.6
intake_esm: 2022.9.18.post4+dirty
netCDF4: 1.6.1
pandas: 1.5.0
requests: 2.28.1
s3fs: 2022.8.2
xarray: 2022.6.0
zarr: 2.12.0