Define and use derived variable registry#

What is a derived variable ?#

A derived variable is a variable that is not present in the original dataset, but is computed from one or more variables in the dataset. For example, a derived variable could be temperature in degrees Fahrenheit. Often times, climate model models write temperature in Celsius or Kelvin, but the user may want degrees Fahrenheit! This is a really simple example; derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.

Note

Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset.

A traditional workflow for derived variables might consist of the following:

  • Load the data

  • Apply some function to the loaded datasets

  • Plot the output

But what if we could couple those first two steps? What if we could have some set of variable definitions, consisting of variable requirements, such as dependent variables, and a function which derives the quantity. This is what the derived_variable funtionality offers in intake-esm! This enables users to share a “registry” of derived variables across catalogs!

Let’s get started with an example!

import intake
from intake_esm import DerivedVariableRegistry

How to define a derived variable#

Let’s compute a derived variable - wind speed! This can be derived from using the zonal (U) and meridional (V) components of the wind.

Step 1: define a function to compute wind speed#

import numpy as np

def calc_wind_speed(ds):
    ds['wind_speed'] = np.sqrt(ds.U ** 2 + ds.V ** 2)
    ds['wind_speed'].attrs = {'units': 'm/s',
                              'long_name': 'Wind Speed',
                              'derived_by': 'intake-esm'}
    return ds

Step 2: create our derived variable registry#

We need to instantiate our derived variable registry, which will store our derived variable information! We use the variable dvr for this (DerivedVariableRegistry).

dvr = DerivedVariableRegistry()

In order to add our derived variable to the registry, we need to add a decoratorto our function. This allows us to define our derived variable, dependent variables, and the function associated with the calculation.

Note

For more in-depth details about decorators, check this tutorial: Primer on Python Decorators

@dvr.register(variable='wind_speed', query={'variable': ['U', 'V']})
def calc_wind_speed(ds):
    ds['wind_speed'] = np.sqrt(ds.U ** 2 + ds.V ** 2)
    ds['wind_speed'].attrs = {'units': 'm/s',
                              'long_name': 'Wind Speed',
                              'derived_by': 'intake-esm'}
    return ds

The register function has two required arguments: variable and query. In this particular example, the derived variable wind_speed is derived from U and V. It is possible to specify additional, required metadata in the query , e.g. U and V from monthly control runs (e.g query={'variable': ['U', 'V'], 'experiment': 'CTRL', 'frequency': 'monthl'} in the case of CESM Large Ensemble).

You’ll notice dvr now has a registered variable, wind_speed, which was defined in the cell above!

dvr
DerivedVariableRegistry({'wind_speed': DerivedVariable(func=<function calc_wind_speed at 0x7efffe5920c0>, variable='wind_speed', query={'variable': ['U', 'V']}, prefer_derived=False)})

Warning

All fields (keys) specified in the query argument when registering a derived variable must be present in the catalog otherwise you will get a validation error when connecting a derived variable registry to an intake-esm catalog.

Step 3: connect our derived variable registry to an intake-esm catalog#

The derived variable registry is now ready to be used with an intake-esm catalog. To do this, we need to add the registry to the catalog. In this case, we will use data from the CESM Large Ensemble (LENS). This is a climate model ensemble, a subset of which is hosted on the AWS Cloud. If you are interested in learning more about this dataset, check out the LENS on AWS documentation page.

We connect our derived variable registry to a catalog by using the registry argument when instantiating the catalog:

data_catalog = intake.open_esm_datastore(
    'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json',
    registry=dvr,
)

You’ll notice we have a new field - derived_variable which has 1 unique value. This is because we have only registered one derived variable, wind_speed.

data_catalog

aws-cesm1-le catalog with 56 dataset(s) from 442 asset(s):

unique
variable 78
long_name 75
component 5
experiment 4
frequency 6
vertical_levels 3
spatial_domain 5
units 25
start_time 12
end_time 13
path 427
derived_variable 1

Let’s also subset for monthly frequency, as well as the 20th century (20C) and RCP 8.5 (RCP85) experiments.

catalog_subset = data_catalog.search(
    variable=['wind_speed'], frequency='monthly', experiment='RCP85'
)

catalog_subset

aws-cesm1-le catalog with 1 dataset(s) from 2 asset(s):

unique
variable 2
long_name 2
component 1
experiment 1
frequency 1
vertical_levels 1
spatial_domain 1
units 1
start_time 1
end_time 1
path 2
derived_variable 1

When loading in the data, intake-esm will lazily add our calculation for wind_speed to the appropriate datasets!

dsets = catalog_subset.to_dataset_dict(
    xarray_open_kwargs={'backend_kwargs': {'storage_options': {'anon': True}}}
)
dsets.keys()
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.frequency'
100.00% [1/1 00:01<00:00]
dict_keys(['atm.RCP85.monthly'])

Let’s look at single dataset from this dictionary of datasets… using the key atm.CTRL.monthly. You’ll notice upon reading in the dataset, we have three variables:

  • U

  • V

  • wind_speed

ds = dsets['atm.RCP85.monthly']
ds
<xarray.Dataset> Size: 908GB
Dimensions:     (member_id: 40, time: 1140, lev: 30, lat: 192, lon: 288, nbnd: 2)
Coordinates:
  * lat         (lat) float64 2kB -90.0 -89.06 -88.12 ... 88.12 89.06 90.0
  * lev         (lev) float64 240B 3.643 7.595 14.36 24.61 ... 957.5 976.3 992.6
  * lon         (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * member_id   (member_id) int64 320B 1 2 3 4 5 6 7 ... 35 101 102 103 104 105
  * time        (time) object 9kB 2006-01-16 12:00:00 ... 2100-12-16 12:00:00
    time_bnds   (time, nbnd) object 18kB dask.array<chunksize=(1140, 2), meta=np.ndarray>
Dimensions without coordinates: nbnd
Data variables:
    U           (member_id, time, lev, lat, lon) float32 303GB dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
    V           (member_id, time, lev, lat, lon) float32 303GB dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
    wind_speed  (member_id, time, lev, lat, lon) float32 303GB dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
Attributes: (12/22)
    Conventions:                       CF-1.0
    NCO:                               4.3.4
    Version:                           $Name$
    host:                              tcs-f02n07
    important_note:                    This data is part of the project 'Blin...
    initial_file:                      b.e11.B20TRC5CNBDRD.f09_g16.105.cam.i....
    ...                                ...
    intake_esm_attrs:spatial_domain:   global
    intake_esm_attrs:units:            m/s
    intake_esm_attrs:start_time:       2006-01-16 12:00:00
    intake_esm_attrs:end_time:         2100-12-16 12:00:00
    intake_esm_attrs:_data_format_:    zarr
    intake_esm_dataset_key:            atm.RCP85.monthly
Hide code cell source
import intake_esm  # just to display version information
intake_esm.show_versions()
Hide code cell output
INSTALLED VERSIONS
------------------

cftime: 1.6.3
dask: 2024.2.1
fastprogress: 1.0.3
fsspec: 2024.2.0
gcsfs: 2024.2.0
intake: 0.7.0
intake_esm: 2024.2.6.post2+dirty
netCDF4: 1.6.5
pandas: 2.2.1
requests: 2.31.0
s3fs: 2024.2.0
xarray: 2024.2.0
zarr: 2.17.0