Use catalogs with assets containing multiple variables#

By default, intake-esm assumes that the data assets (files) contain a single variable (e.g. temperature, precipitation, etc..). If you have multiple variables in your data files, intake-esm requires the following:

  • the variable_column of the catalog must contain iterables (list, tuple, set) of values (e.g. ['temperature', 'precipitation']).

  • the user must provide converters with appropriate functions for parsing values in the variable_column (and/or any other column with iterables) into iterables when loading the catalog. There are two ways to do this with the open_esm_datastore function: either pass the converter functions directly through the read_csv_kwargs argument, or specify the columns in columns_with_iterables parameter. The latter is a shortcut for the former. Both are demonstrated below.

Inspect the catalog#

In the example below, we are are going to use the following catalog to demonstrate how to work with multi-variable assets:

# Look at the catalog on disk
!cat multi-variable-catalog.csv
experiment,case,component,stream,variable,member_id,path,time_range
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'TEMP', 'SiO3']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-TEMP-SiO3.050001-050012.nc,050001-050012

As you can see, the variable column contains a list of varibles, and this list was serialized as a string: "['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']".

Load the catalog#

import intake
import ast
import dask

# Make sure this is single-threaded
dask.config.set(scheduler='single-threaded')

cat = intake.open_esm_datastore(
    "multi-variable-catalog.json",
    read_csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
cat

sample-multi-variable-cesm1-lens catalog with 1 dataset(s) from 5 asset(s):

unique
experiment 1
case 1
component 1
stream 1
variable 10
member_id 1
path 5
time_range 2
derived_variable 0

To confirm that intake-esm has loaded the catalog correctly, we can inspect the .has_multiple_variable_assets property:

cat.esmcat.has_multiple_variable_assets
True

Alternatively, we can specify the variable column name in the columns_with_iterables parameter:

cat = intake.open_esm_datastore(
    "multi-variable-catalog.json",
    columns_with_iterables=["variable"],
)
cat.esmcat.has_multiple_variable_assets
True

Search for datasets#

The search functionatilty works in the same way:

cat_subset =cat.search(variable=["O2", "SiO3"])
cat_subset.df
experiment case component stream variable member_id path time_range
0 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h [SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2] 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
1 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h [SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2] 5 ../../../tests/sample_data/cesm-multi-variable... 050101-050112
2 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h [SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3] 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012

Load assets into xarray datasets#

When loading the data files into xarray datasets, intake-esm will load only data variables that were requested. For example, if a data file contains ten data variables and the user requests for two variables, intake-esm will load the two requested variables plus necessary coordinates information.

dsets = cat_subset.to_dataset_dict()
dsets
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.stream'
100.00% [1/1 00:00<00:00]
{'ocn.CTRL.pop.h': <xarray.Dataset> Size: 1kB
 Dimensions:             (time: 24, member_id: 1, nlat: 2, nlon: 2)
 Coordinates: (12/36)
   * time                (time) object 192B 0500-02-01 00:00:00 ... 0502-02-01...
   * member_id           (member_id) int64 8B 5
     T0_Kelvin           float64 8B 273.1
     TLAT                (nlat, nlon) float64 32B dask.array<chunksize=(2, 2), meta=np.ndarray>
     TLONG               (nlat, nlon) float64 32B dask.array<chunksize=(2, 2), meta=np.ndarray>
     ULAT                (nlat, nlon) float64 32B dask.array<chunksize=(2, 2), meta=np.ndarray>
     ...                  ...
     salt_to_ppt         float64 8B 1e+03
     sea_ice_salinity    float64 8B 4.0
     sflux_factor        float64 8B 0.1
     sound               float64 8B 1.5e+05
     stefan_boltzmann    float64 8B 5.67e-08
     vonkar              float64 8B 0.4
 Dimensions without coordinates: nlat, nlon
 Data variables:
     O2                  (member_id, time, nlat, nlon) float32 384B dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
     SiO3                (member_id, time, nlat, nlon) float32 384B dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
 Attributes: (12/23)
     title:                           b.e11.B1850C5CN.f09_g16.005
     history:                         Fri Oct 11 01:05:51 2013: /glade/apps/op...
     Conventions:                     CF-1.0; http://www.cgd.ucar.edu/cms/eato...
     contents:                        Diagnostic and Prognostic Variables
     source:                          CCSM POP2, the CCSM Ocean Component
     revision:                        $Id: tavg.F90 41939 2012-11-14 16:37:23Z...
     ...                              ...
     intake_esm_attrs:stream:         pop.h
     intake_esm_attrs:member_id:      5
     intake_esm_attrs:_data_format_:  netcdf
     intake_esm_attrs:path:           ../../../tests/sample_data/cesm-multi-va...
     intake_esm_attrs:time_range:     050001-050012
     intake_esm_dataset_key:          ocn.CTRL.pop.h}

Why does intake.open_esm_datastore need the columns_with_iterables parameter?#

Why does intake intake.open_esm_datastore need the columns_with_iterables argument when we can achieve the same functionality with just read_csv_kwargs? Intake facilitates writing YAML descriptions of catalogs that can be opened with intake.open_catalog. These YAML descriptions include the information required to open the catalog: things like the catalog driver (intake_esm.core.esm_datastore in our case) and the arguments to pass to the driver to open the catalog. They can be included as entries in other catalogs enabling features like catalog nesting. However, intake does not support Python function arguments like those we provided to read_csv_kwargs above so if we want a functional intake YAML description of an intake-esm catalog with multi-variable assets we need to use the columns_with_iterables argument instead. You can return an intake YAML description of an esm_datastore as follows:

cat.name = "my-esm-catalog"
print(cat.yaml())
sources:
  my-esm-catalog:
    args:
      columns_with_iterables:
      - variable
      obj: multi-variable-catalog.json
    description: ''
    driver: intake_esm.core.esm_datastore
    metadata: {}
Hide code cell source
import intake_esm  # just to display version information
intake_esm.show_versions()
Hide code cell output
INSTALLED VERSIONS
------------------

cftime: 1.6.4
dask: 2024.9.0
fastprogress: 1.0.3
fsspec: 2024.9.0
gcsfs: 2024.9.0post1
intake: 0.7.0
intake_esm: 2024.2.6.post16+g6ba67e1.d20240918
netCDF4: 1.7.1
pandas: 2.2.2
requests: 2.32.3
s3fs: 2024.9.0
xarray: 2024.9.0
zarr: 2.18.3