Use catalogs with assets containing multiple variables#
By default, intake-esm
assumes that the data assets (files) contain a single variable (e.g. temperature
, precipitation
, etc..). If you have multiple variables in your data files, intake-esm requires the following:
the
variable_column
of the catalog must contain iterables (list, tuple, set) of values (e.g.['temperature', 'precipitation']
).the user must provide converters with appropriate functions for parsing values in the
variable_column
(and/or any other column with iterables) into iterables when loading the catalog. There are two ways to do this with theopen_esm_datastore
function: either pass the converter functions directly through theread_csv_kwargs
argument, or specify the columns incolumns_with_iterables
parameter. The latter is a shortcut for the former. Both are demonstrated below.
Inspect the catalog#
In the example below, we are are going to use the following catalog to demonstrate how to work with multi-variable assets:
# Look at the catalog on disk
!cat multi-variable-catalog.csv
experiment,case,component,stream,variable,member_id,path,time_range
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'TEMP', 'SiO3']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-TEMP-SiO3.050001-050012.nc,050001-050012
As you can see, the variable column contains a list of varibles, and this list
was serialized as a string:
"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']"
.
Load the catalog#
import intake
import ast
import dask
# Make sure this is single-threaded
dask.config.set(scheduler='single-threaded')
cat = intake.open_esm_datastore(
"multi-variable-catalog.json",
read_csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
cat
sample-multi-variable-cesm1-lens catalog with 1 dataset(s) from 5 asset(s):
unique | |
---|---|
experiment | 1 |
case | 1 |
component | 1 |
stream | 1 |
variable | 10 |
member_id | 1 |
path | 5 |
time_range | 2 |
derived_variable | 0 |
To confirm that intake-esm has loaded the catalog correctly, we can inspect the .has_multiple_variable_assets
property:
cat.esmcat.has_multiple_variable_assets
True
Alternatively, we can specify the variable column name in the columns_with_iterables
parameter:
cat = intake.open_esm_datastore(
"multi-variable-catalog.json",
columns_with_iterables=["variable"],
)
cat.esmcat.has_multiple_variable_assets
True
Search for datasets#
The search functionatilty works in the same way:
cat_subset =cat.search(variable=["O2", "SiO3"])
cat_subset.df
experiment | case | component | stream | variable | member_id | path | time_range | |
---|---|---|---|---|---|---|---|---|
0 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | [SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2] | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
1 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | [SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2] | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050101-050112 |
2 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | [SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3] | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
Load assets into xarray datasets#
When loading the data files into xarray datasets, intake-esm
will load only data variables that were requested. For example, if a data file contains ten data variables and the user requests for two variables, intake-esm will load the two requested variables plus necessary coordinates information.
dsets = cat_subset.to_dataset_dict()
dsets
--> The keys in the returned dictionary of datasets are constructed as follows:
'component.experiment.stream'
{'ocn.CTRL.pop.h': <xarray.Dataset> Size: 1kB
Dimensions: (time: 24, member_id: 1, nlat: 2, nlon: 2)
Coordinates: (12/36)
* time (time) object 192B 0500-02-01 00:00:00 ... 0502-02-01...
* member_id (member_id) int64 8B 5
T0_Kelvin float64 8B 273.1
TLAT (nlat, nlon) float64 32B dask.array<chunksize=(2, 2), meta=np.ndarray>
TLONG (nlat, nlon) float64 32B dask.array<chunksize=(2, 2), meta=np.ndarray>
ULAT (nlat, nlon) float64 32B dask.array<chunksize=(2, 2), meta=np.ndarray>
... ...
salt_to_ppt float64 8B 1e+03
sea_ice_salinity float64 8B 4.0
sflux_factor float64 8B 0.1
sound float64 8B 1.5e+05
stefan_boltzmann float64 8B 5.67e-08
vonkar float64 8B 0.4
Dimensions without coordinates: nlat, nlon
Data variables:
O2 (member_id, time, nlat, nlon) float32 384B dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
SiO3 (member_id, time, nlat, nlon) float32 384B dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
Attributes: (12/23)
title: b.e11.B1850C5CN.f09_g16.005
history: Fri Oct 11 01:05:51 2013: /glade/apps/op...
Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eato...
contents: Diagnostic and Prognostic Variables
source: CCSM POP2, the CCSM Ocean Component
revision: $Id: tavg.F90 41939 2012-11-14 16:37:23Z...
... ...
intake_esm_attrs:stream: pop.h
intake_esm_attrs:member_id: 5
intake_esm_attrs:_data_format_: netcdf
intake_esm_attrs:path: ../../../tests/sample_data/cesm-multi-va...
intake_esm_attrs:time_range: 050001-050012
intake_esm_dataset_key: ocn.CTRL.pop.h}
Why does intake.open_esm_datastore
need the columns_with_iterables
parameter?#
Why does intake intake.open_esm_datastore
need the columns_with_iterables
argument when we can achieve the same functionality with just read_csv_kwargs
? Intake facilitates writing YAML descriptions of catalogs that can be opened with intake.open_catalog
. These YAML descriptions include the information required to open the catalog: things like the catalog driver (intake_esm.core.esm_datastore
in our case) and the arguments to pass to the driver to open the catalog. They can be included as entries in other catalogs enabling features like catalog nesting. However, intake does not support Python function arguments like those we provided to read_csv_kwargs
above so if we want a functional intake YAML description of an intake-esm catalog with multi-variable assets we need to use the columns_with_iterables
argument instead. You can return an intake YAML description of an esm_datastore
as follows:
cat.name = "my-esm-catalog"
print(cat.yaml())
sources:
my-esm-catalog:
args:
columns_with_iterables:
- variable
obj: multi-variable-catalog.json
description: ''
driver: intake_esm.core.esm_datastore
metadata: {}
Show code cell source
import intake_esm # just to display version information
intake_esm.show_versions()
Show code cell output
INSTALLED VERSIONS
------------------
cftime: 1.6.4
dask: 2024.9.0
fastprogress: 1.0.3
fsspec: 2024.9.0
gcsfs: 2024.9.0post1
intake: 0.7.0
intake_esm: 2024.2.6.post16+g6ba67e1.d20240918
netCDF4: 1.7.1
pandas: 2.2.2
requests: 2.32.3
s3fs: 2024.9.0
xarray: 2024.9.0
zarr: 2.18.3