Overview

Intake-esm is a data cataloging utility built on top of intake, pandas, and xarray. Intake-esm aims to facilitate:

  • the discovery of earth’s climate and weather datasets.

  • the ingestion of these datasets into xarray dataset containers.

It’s basic usage is shown below. To begin, let’s import intake:

import intake

Loading a catalog

At import time, intake-esm plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore() function. For demonstration purposes, we are going to use the catalog for Community Earth System Model Large ensemble (CESM LENS) dataset publicly available in Amazon S3.

Note

You can learn more about CESM LENS dataset in AWS S3 here

You can load data from an ESM Catalog by providing the URL to valid ESM Catalog:

catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(catalog_url)
col

aws-cesm1-le catalog with 56 dataset(s) from 429 asset(s):

unique
component 5
frequency 6
experiment 4
variable 73
path 414
variable_long_name 70
dim_per_tstep 3
start 12
end 13

The summary above tells us that this catalog contains over 400 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:

col.df.head()
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 atm daily 20C FLNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.... net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
1 atm daily 20C FLNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC... clearsky net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
2 atm daily 20C FLUT s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.... upwelling longwave flux at top of model 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
3 atm daily 20C FSNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.... net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
4 atm daily 20C FSNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC... clearsky net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00

Finding unique entries for individual columns

To get unique values for given columns in the catalog, intake-esm provides a unique() method. This method returns a dictionary containing count, and unique values:

col.unique(columns=["component", "frequency", "experiment"])
{'component': {'count': 5,
  'values': ['atm', 'ice_nh', 'ice_sh', 'lnd', 'ocn']},
 'frequency': {'count': 6,
  'values': ['daily',
   'hourly6-1990-2005',
   'hourly6-2026-2035',
   'hourly6-2071-2080',
   'monthly',
   'static']},
 'experiment': {'count': 4, 'values': ['20C', 'CTRL', 'HIST', 'RCP85']}}

Loading datasets

Intake-esm implements convenience utilities for loading the query results into higher level xarray datasets. The logic for merging/concatenating the query results into higher level xarray datasets is provided in the input JSON file and is available under .aggregation_info property:

col.aggregation_info
AggregationInfo(groupby_attrs=['component', 'experiment', 'frequency'], variable_column_name='variable', aggregations=[{'type': 'union', 'attribute_name': 'variable', 'options': {'compat': 'override'}}], agg_columns=['variable'], aggregation_dict={'variable': {'type': 'union', 'options': {'compat': 'override'}}})
col.aggregation_info.aggregations
[{'type': 'union',
  'attribute_name': 'variable',
  'options': {'compat': 'override'}}]
# Dataframe columns used to determine groups of compatible datasets.
col.aggregation_info.groupby_attrs  # or col.groupby_attrs
['component', 'experiment', 'frequency']
# List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.
col.aggregation_info.agg_columns  # or col.agg_columns
['variable']

To load data assets into xarray datasets, we need to use the to_dataset_dict() method. This method returns a dictionary of aggregate xarray datasets as the name hints.

dset_dicts = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.frequency'
100.00% [4/4 00:00<00:00]
[key for key in dset_dicts.keys()]
['ice_nh.HIST.monthly',
 'ice_nh.20C.monthly',
 'lnd.HIST.monthly',
 'lnd.20C.monthly']

We can access a particular dataset as follows:

ds = dset_dicts["lnd.20C.monthly"]
print(ds)
<xarray.Dataset>
Dimensions:         (hist_interval: 2, lat: 192, levgrnd: 15, lon: 288, member_id: 40, time: 1032)
Coordinates:
  * lat             (lat) float64 -90.0 -89.06 -88.12 ... 88.12 89.06 90.0
  * lon             (lon) float32 0.0 1.25 2.5 3.75 ... 356.25 357.5 358.75
  * member_id       (member_id) int64 1 2 3 4 5 6 7 ... 35 101 102 103 104 105
  * time            (time) object 1920-01-16 12:00:00 ... 2005-12-16 12:00:00
    time_bounds     (time, hist_interval) object dask.array<chunksize=(1032, 2), meta=np.ndarray>
  * levgrnd         (levgrnd) float32 0.007100635 0.027925 ... 21.32647 35.17762
Dimensions without coordinates: hist_interval
Data variables:
    FSNO            (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    H2OSNO          (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    QRUNOFF         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    RAIN            (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    SNOW            (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    SOILLIQ         (member_id, time, levgrnd, lat, lon) float32 dask.array<chunksize=(1, 40, 15, 192, 288), meta=np.ndarray>
    SOILWATER_10CM  (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
Attributes:
    hostname:                             tcs
    nco_openmp_thread_number:             1
    username:                             mudryk
    title:                                CLM History file information
    version:                              cesm1_1_1_alpha01g
    Conventions:                          CF-1.0
    case_title:                           UNSET
    revision_id:                          $Id: histFileMod.F90 40539 2012-09-...
    source:                               Community Land Model CLM4.0
    Surface_dataset:                      surfdata_0.9x1.25_simyr1850_c110921.nc
    Initial_conditions_dataset:           b.e11.B20TRC5CNBDRD.f09_g16.001.clm...
    comment:                              NOTE: None of the variables are wei...
    PFT_physiological_constants_dataset:  pft-physiology.c110425.nc
    NCO:                                  4.3.4
    intake_esm_varname:                   FSNO\nH2OSNO\nQRUNOFF\nRAIN\nSNOW\n...
    intake_esm_dataset_key:               lnd.20C.monthly

Let’s create a quick plot for a slice of the data:

ds.SNOW.isel(time=0, member_id=range(1, 24, 4)).plot(col="member_id", col_wrap=3, robust=True)
<xarray.plot.facetgrid.FacetGrid at 0x1657d0820>
../_images/overview_28_1.png