Use catalogs with assets containing multiple variables#
By default, intake-esm
assumes that the data assets (files) contain a single variable (e.g. temperature
, precipitation
, etc..). If you have multiple variables in your data files, intake-esm requires the following:
the
variable_column
of the catalog must contain iterables (list, tuple, set) of values (e.g.['temperature', 'precipitation']
).the user must provide a
converters
dictionary with appropriate functions for parsing values in thevariable_column
and/or any other column with iterables into iterables when loading the catalog. This is done via theread_csv_kwargs
argument of theopen_esm_datastore
function.
Inspect the catalog#
In the example below, we are are going to use the following catalog to demonstrate how to work with multi-variable assets:
# Look at the catalog on disk
!cat multi-variable-catalog.csv
experiment,case,component,stream,variable,member_id,path,time_range
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'TEMP', 'SiO3']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-TEMP-SiO3.050001-050012.nc,050001-050012
As you can see, the variable column contains a list of varibles, and this list
was serialized as a string:
"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']"
.
Load the catalog#
import intake
import ast
import dask
# Make sure this is single-threaded
dask.config.set(scheduler='single-threaded')
cat = intake.open_esm_datastore(
"multi-variable-catalog.json",
read_csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
cat
sample-multi-variable-cesm1-lens catalog with 1 dataset(s) from 5 asset(s):
unique | |
---|---|
experiment | 1 |
case | 1 |
component | 1 |
stream | 1 |
variable | 10 |
member_id | 1 |
path | 5 |
time_range | 2 |
derived_variable | 0 |
To confirm that intake-esm has loaded the catalog correctly, we can inspect the .has_multiple_variable_assets
property:
cat.esmcat.has_multiple_variable_assets
True
Search for datasets#
The search functionatilty works in the same way:
cat_subset =cat.search(variable=["O2", "SiO3"])
cat_subset.df
experiment | case | component | stream | variable | member_id | path | time_range | |
---|---|---|---|---|---|---|---|---|
0 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | [SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2] | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
1 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | [SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2] | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050101-050112 |
2 | CTRL | b.e11.B1850C5CN.f09_g16.005 | ocn | pop.h | [SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3] | 5 | ../../../tests/sample_data/cesm-multi-variable... | 050001-050012 |
Load assets into xarray datasets#
When loading the data files into xarray datasets, intake-esm
will load only data variables that were requested. For example, if a data file contains ten data variables and the user requests for two variables, intake-esm will load the two requested variables plus necessary coordinates information.
dsets = cat_subset.to_dataset_dict()
dsets
--> The keys in the returned dictionary of datasets are constructed as follows:
'component.experiment.stream'
{'ocn.CTRL.pop.h': <xarray.Dataset>
Dimensions: (time: 24, member_id: 1, nlat: 2, nlon: 2)
Coordinates: (12/36)
* time (time) object 0500-02-01 00:00:00 ... 0502-02-01 00:0...
* member_id (member_id) int64 5
T0_Kelvin float64 273.1
TLAT (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
TLONG (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
ULAT (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
... ...
salt_to_ppt float64 1e+03
sea_ice_salinity float64 4.0
sflux_factor float64 0.1
sound float64 1.5e+05
stefan_boltzmann float64 5.67e-08
vonkar float64 0.4
Dimensions without coordinates: nlat, nlon
Data variables:
O2 (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
SiO3 (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 24, 2, 2), meta=np.ndarray>
Attributes: (12/23)
title: b.e11.B1850C5CN.f09_g16.005
history: Fri Oct 11 01:05:51 2013: /glade/apps/op...
Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eato...
contents: Diagnostic and Prognostic Variables
source: CCSM POP2, the CCSM Ocean Component
revision: $Id: tavg.F90 41939 2012-11-14 16:37:23Z...
... ...
intake_esm_attrs:stream: pop.h
intake_esm_attrs:member_id: 5
intake_esm_attrs:_data_format_: netcdf
intake_esm_attrs:path: ../../../tests/sample_data/cesm-multi-va...
intake_esm_attrs:time_range: 050001-050012
intake_esm_dataset_key: ocn.CTRL.pop.h}
Show code cell source
import intake_esm # just to display version information
intake_esm.show_versions()
Show code cell output
INSTALLED VERSIONS
------------------
cftime: 1.6.2
dask: 2023.3.1
fastprogress: 1.0.3
fsspec: 2023.3.0
gcsfs: 2023.3.0
intake: 0.6.8
intake_esm: 2022.9.18.post30+dirty
netCDF4: 1.6.3
pandas: 1.5.3
requests: 2.28.2
s3fs: 2023.3.0
xarray: 2023.2.0
zarr: 2.14.2