Working with multi-variable assets¶
In addition to catalogs of data assets (files) in time-series (single-variable) format, intake-esm supports catalogs with data assets in time-slice (history) format and/or files with multiple variables. For intake-esm to properly work with multi-variable assets,
the
variable_column
of the catalog must contain iterables (list, tuple, set) of values.the user must specifiy a dictionary of functions for converting values in certain columns into iterables. This is done via the
csv_kwargs
argument.
In the example below, we are are going to use the following catalog to demonstrate how to work with multi-variable assets:
# Look at the catalog on disk
!cat multi-variable-catalog.csv
experiment,case,component,stream,variable,member_id,path,time_range
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'TEMP', 'SiO3']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-TEMP-SiO3.050001-050012.nc,050001-050012
As you can see, the variable column contains a list of varibles, and this list
was serialized as a string:
"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']"
.
Loading a catalog¶
To load a catalog with multiple variable files, we must pass additional
information to open_esm_datastore
via the csv_kwargs
argument. We are going
to specify a dictionary of functions for converting values in variable
column
into iterables. We use the literal_eval
function from the standard ast
module:
import ast
import intake
col = intake.open_esm_datastore(
"multi-variable-collection.json",
csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
col
sample-multi-variable-cesm1-lens catalog with 1 dataset(s) from 5 asset(s):
unique | |
---|---|
experiment | 1 |
case | 1 |
component | 1 |
stream | 1 |
variable | 10 |
member_id | 1 |
path | 5 |
time_range | 2 |
col.df.head()
experiment | case | component | stream | variable | member_id | path | time_range | |
---|---|---|---|---|---|---|---|---|
0 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
1 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050101-050112 |
2 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, PO4) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
3 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, PO4) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050101-050112 |
4 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
The in-memory representation of the catalog contains variable
with tuple of
values. To confirm that intake-esm has registered this catalog with multiple
variable assets, we can the ._multiple_variable_assets
property:
col._multiple_variable_assets
True
Searching¶
The search functionatilty works in the same way:
col_subset = col.search(variable=["O2", "SiO3"])
col_subset.df
experiment | case | component | stream | variable | member_id | path | time_range | |
---|---|---|---|---|---|---|---|---|
0 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
1 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050101-050112 |
2 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | (SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3) | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
Loading assets into xarray datasets¶
Loading data assets into xarray datasets works in the same way too:
col_subset.to_dataset_dict(cdf_kwargs={})
--> The keys in the returned dictionary of datasets are constructed as follows:
'component.experiment.stream'
{'ocn.CTRL.pop.h': <xarray.Dataset>
Dimensions: (time: 24, member_id: 1, nlat: 2, nlon: 2)
Coordinates:
* time (time) object 0500-02-01 00:00:00 ... 0502-02-01 00:00:00
TLAT (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
TLONG (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
ULAT (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
ULONG (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
* member_id (member_id) int64 5
Dimensions without coordinates: nlat, nlon
Data variables:
O2 (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
SiO3 (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 24, 2, 2), meta=np.ndarray>
Attributes: (12/16)
start_time: This dataset was created on 2013-05-28 at 02:4...
revision: $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy...
tavg_sum: 2678400.0
tavg_sum_qflux: 2678400.0
NCO: 4.3.4
title: b.e11.B1850C5CN.f09_g16.005
... ...
cell_methods: cell_methods = time: mean ==> the variable val...
nco_openmp_thread_number: 1
Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc...
intake_esm_varname: O2\nSiO3
calendar: All years have exactly 365 days.
intake_esm_dataset_key: ocn.CTRL.pop.h}
import intake_esm # just to display version information
intake_esm.show_versions()
INSTALLED VERSIONS
------------------
cftime: 1.5.0
dask: 2021.08.0
fastprogress: 0.2.7
fsspec: 2021.07.0
gcsfs: 2021.07.0
intake: 0.6.3
intake_esm: 0.0.0
netCDF4: 1.5.7
pandas: 1.3.2
requests: 2.26.0
s3fs: 2021.07.0
xarray: 0.19.0
zarr: 2.8.3