# Define and use derived variable registry#

## What is a derived variable ?#

A derived variable is a variable that is not present in the original dataset, but is computed from one or more variables in the dataset. For example, a derived variable could be temperature in degrees Fahrenheit. Often times, climate model models write temperature in Celsius or Kelvin, but the user may want degrees Fahrenheit! This is a really simple example; derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.

Note

Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset.

A traditional workflow for derived variables might consist of the following:

• Load the data

• Apply some function to the loaded datasets

• Plot the output

But what if we could couple those first two steps? What if we could have some set of variable definitions, consisting of variable requirements, such as `dependent variables`, and a function which derives the quantity. This is what the `derived_variable` funtionality offers in `intake-esm`! This enables users to share a “registry” of derived variables across catalogs!

Let’s get started with an example!

```import intake
from intake_esm import DerivedVariableRegistry
```

## How to define a derived variable#

Let’s compute a derived variable - wind speed! This can be derived from using the zonal (`U`) and meridional (`V`) components of the wind.

### Step 1: define a function to compute `wind speed`#

```import numpy as np

def calc_wind_speed(ds):
ds['wind_speed'] = np.sqrt(ds.U ** 2 + ds.V ** 2)
ds['wind_speed'].attrs = {'units': 'm/s',
'long_name': 'Wind Speed',
'derived_by': 'intake-esm'}
return ds
```

### Step 2: create our derived variable registry#

We need to instantiate our derived variable registry, which will store our derived variable information! We use the variable `dvr` for this (DerivedVariableRegistry).

```dvr = DerivedVariableRegistry()
```

In order to add our derived variable to the registry, we need to add a decoratorto our function. This allows us to define our derived variable, dependent variables, and the function associated with the calculation.

Note

For more in-depth details about decorators, check this tutorial: Primer on Python Decorators

```@dvr.register(variable='wind_speed', query={'variable': ['U', 'V']})
def calc_wind_speed(ds):
ds['wind_speed'] = np.sqrt(ds.U ** 2 + ds.V ** 2)
ds['wind_speed'].attrs = {'units': 'm/s',
'long_name': 'Wind Speed',
'derived_by': 'intake-esm'}
return ds
```

The `register` function has two required arguments: `variable` and `query`. In this particular example, the derived variable `wind_speed` is derived from `U` and `V`. It is possible to specify additional, required metadata in the query , e.g. `U` and `V` from monthly control runs (e.g `query={'variable': ['U', 'V'], 'experiment': 'CTRL', 'frequency': 'monthl'}` in the case of CESM Large Ensemble).

You’ll notice `dvr` now has a registered variable, `wind_speed`, which was defined in the cell above!

```dvr
```
```DerivedVariableRegistry({'wind_speed': DerivedVariable(func=<function calc_wind_speed at 0x7f4c4c5756c0>, variable='wind_speed', query={'variable': ['U', 'V']}, prefer_derived=False)})
```

Warning

All fields (keys) specified in the query argument when registering a derived variable must be present in the catalog otherwise you will get a validation error when connecting a derived variable registry to an intake-esm catalog.

### Step 3: connect our derived variable registry to an intake-esm catalog#

The derived variable registry is now ready to be used with an intake-esm catalog. To do this, we need to add the registry to the catalog. In this case, we will use data from the CESM Large Ensemble (LENS). This is a climate model ensemble, a subset of which is hosted on the AWS Cloud. If you are interested in learning more about this dataset, check out the LENS on AWS documentation page.

We connect our derived variable registry to a catalog by using the `registry` argument when instantiating the catalog:

```data_catalog = intake.open_esm_datastore(
'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json',
registry=dvr,
)
```

You’ll notice we have a new field - `derived_variable` which has 1 unique value. This is because we have only registered one derived variable, `wind_speed`.

```data_catalog
```

aws-cesm1-le catalog with 56 dataset(s) from 442 asset(s):

unique
variable 78
long_name 75
component 5
experiment 4
frequency 6
vertical_levels 3
spatial_domain 5
units 25
start_time 12
end_time 13
path 427
derived_variable 1

Let’s also subset for monthly frequency, as well as the 20th century (20C) and RCP 8.5 (RCP85) experiments.

```catalog_subset = data_catalog.search(
variable=['wind_speed'], frequency='monthly', experiment='RCP85'
)

catalog_subset
```

aws-cesm1-le catalog with 1 dataset(s) from 2 asset(s):

unique
variable 2
long_name 2
component 1
experiment 1
frequency 1
vertical_levels 1
spatial_domain 1
units 1
start_time 1
end_time 1
path 2
derived_variable 1

When loading in the data, `intake-esm` will lazily add our calculation for `wind_speed` to the appropriate datasets!

```dsets = catalog_subset.to_dataset_dict(
xarray_open_kwargs={'backend_kwargs': {'storage_options': {'anon': True}}}
)
dsets.keys()
```
```--> The keys in the returned dictionary of datasets are constructed as follows:
'component.experiment.frequency'
```
100.00% [1/1 00:01<00:00]
```dict_keys(['atm.RCP85.monthly'])
```

Let’s look at single dataset from this dictionary of datasets… using the key `atm.CTRL.monthly`. You’ll notice upon reading in the dataset, we have three variables:

• `U`

• `V`

• `wind_speed`

```ds = dsets['atm.RCP85.monthly']
ds
```
```<xarray.Dataset>
Dimensions:     (member_id: 40, time: 1140, lev: 30, lat: 192, lon: 288, nbnd: 2)
Coordinates:
* lat         (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
* lev         (lev) float64 3.643 7.595 14.36 24.61 ... 957.5 976.3 992.6
* lon         (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
* member_id   (member_id) int64 1 2 3 4 5 6 7 8 ... 34 35 101 102 103 104 105
* time        (time) object 2006-01-16 12:00:00 ... 2100-12-16 12:00:00
time_bnds   (time, nbnd) object dask.array<chunksize=(1140, 2), meta=np.ndarray>
Dimensions without coordinates: nbnd
Data variables:
U           (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
V           (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
wind_speed  (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
Attributes: (12/22)
Conventions:                       CF-1.0
NCO:                               4.3.4
Version:                           \$Name\$
host:                              tcs-f02n07
important_note:                    This data is part of the project 'Blin...
initial_file:                      b.e11.B20TRC5CNBDRD.f09_g16.105.cam.i....
...                                ...
intake_esm_attrs:spatial_domain:   global
intake_esm_attrs:units:            m/s
intake_esm_attrs:start_time:       2006-01-16 12:00:00
intake_esm_attrs:end_time:         2100-12-16 12:00:00
intake_esm_attrs:_data_format_:    zarr
intake_esm_dataset_key:            atm.RCP85.monthly```
```import intake_esm  # just to display version information
intake_esm.show_versions()
```
```INSTALLED VERSIONS
------------------

cftime: 1.6.2