Working with multi-variable assets

In addition to catalogs of data assets (files) in time-series (single-variable) format, intake-esm supports catalogs with data assets in time-slice (history) format and/or files with multiple variables. For intake-esm to properly work with multi-variable assets,

  • the variable_column of the catalog must contain iterables (list, tuple, set) of values.

  • the user must specifiy a dictionary of functions for converting values in certain columns into iterables. This is done via the csv_kwargs argument.

In the example below, we are are going to use the following catalog to demonstrate how to work with multi-variable assets:

# Look at the catalog on disk
!cat multi-variable-catalog.csv
experiment,case,component,stream,variable,member_id,path,time_range
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'TEMP', 'SiO3']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-TEMP-SiO3.050001-050012.nc,050001-050012

As you can see, the variable column contains a list of varibles, and this list was serialized as a string: "['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']".

Loading a catalog

To load a catalog with multiple variable files, we must pass additional information to open_esm_datastore via the csv_kwargs argument. We are going to specify a dictionary of functions for converting values in variable column into iterables. We use the literal_eval function from the standard ast module:

import intake
import ast
col = intake.open_esm_datastore(
    "multi-variable-collection.json",
    csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
col

sample-multi-variable-cesm1-lens catalog with 1 dataset(s) from 5 asset(s):

unique
experiment 1
case 1
component 1
stream 1
variable 10
member_id 1
path 5
time_range 2
col.df.head()
experiment case component stream variable member_id path time_range
0 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
1 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050101-050112
2 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, PO4) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
3 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, PO4) 5 ../../../tests/sample_data/cesm-multi-variable... 050101-050112
4 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012

The in-memory representation of the catalog contains variable with tuple of values. To confirm that intake-esm has registered this catalog with multiple variable assets, we can the ._multiple_variable_assets property:

col._multiple_variable_assets
True

Searching

The search functionatilty works in the same way:

col_subset = col.search(variable=["O2", "SiO3"])
col_subset.df
experiment case component stream variable member_id path time_range
0 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
1 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050101-050112
2 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012

Loading assets into xarray datasets

Loading data assets into xarray datasets works in the same way too:

col_subset.to_dataset_dict(cdf_kwargs={})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.stream'
100.00% [1/1 00:00<00:00]
{'ocn.CTRL.pop.h': <xarray.Dataset>
 Dimensions:    (member_id: 1, nlat: 2, nlon: 2, time: 24)
 Coordinates:
   * time       (time) object 0500-02-01 00:00:00 ... 0502-02-01 00:00:00
     TLAT       (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
     TLONG      (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
     ULAT       (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
     ULONG      (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
   * member_id  (member_id) int64 5
 Dimensions without coordinates: nlat, nlon
 Data variables:
     O2         (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
     SiO3       (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 24, 2, 2), meta=np.ndarray>
 Attributes:
     calendar:                  All years have exactly  365 days.
     Conventions:               CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc...
     tavg_sum:                  2678400.0
     nco_openmp_thread_number:  1
     contents:                  Diagnostic and Prognostic Variables
     cell_methods:              cell_methods = time: mean ==> the variable val...
     NCO:                       4.3.4
     start_time:                This dataset was created on 2013-05-28 at 02:4...
     tavg_sum_qflux:            2678400.0
     revision:                  $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy...
     intake_esm_varname:        O2\nSiO3
     source:                    CCSM POP2, the CCSM Ocean Component
     history:                   Fri Oct 11 01:05:51 2013: /glade/apps/opt/nco/...
     title:                     b.e11.B1850C5CN.f09_g16.005
     nsteps_total:              1953500
     intake_esm_dataset_key:    ocn.CTRL.pop.h}