Intake-esm

Badges

CI

GitHub Workflow Status GitHub Workflow Status Code Coverage Status

Docs

Documentation Status

Package

Conda PyPI

License

License

Citation

Zenodo

Motivation

Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc…). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.

Finding, investigating, loading these assets into data array containers such as xarray can be a daunting task due to the large number of files a user may be interested in. Intake-esm aims to address these issues by providing necessary functionality for searching, discovering, data access/loading.

Overview

intake-esm is a data cataloging utility built on top of intake, pandas, and xarray, and it’s pretty awesome!

  • Opening an ESM collection definition file: An ESM (Earth System Model) collection file is a JSON file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (CSV file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from). The collection JSON file can be stored on a local filesystem or can be hosted on a remote server.

    
    In [1]: import intake
    
    In [2]: col_url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
    
    In [3]: col = intake.open_esm_datastore(col_url)
    
    In [4]: col
    Out[4]: <pangeo-cmip6 catalog with 4287 dataset(s) from 282905 asset(s)>
    
  • Search and Discovery: intake-esm provides functionality to execute queries against the catalog:

    In [5]: col_subset = col.search(
       ...:     experiment_id=["historical", "ssp585"],
       ...:     table_id="Oyr",
       ...:     variable_id="o2",
       ...:     grid_label="gn",
       ...: )
    
    In [6]: col_subset
    Out[6]: <pangeo-cmip6 catalog with 18 dataset(s) from 138 asset(s)>
    
  • Access: when the user is satisfied with the results of their query, they can ask intake-esm to load data assets (netCDF/HDF files and/or Zarr stores) into xarray datasets:

    
      In [7]: dset_dict = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
    
      --> The keys in the returned dictionary of datasets are constructed as follows:
              'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
      |███████████████████████████████████████████████████████████████| 100.00% [18/18 00:10<00:00]
    

See documentation for more information.

Installation

Intake-esm can be installed from PyPI with pip:

python -m pip install intake-esm

It is also available from conda-forge for conda installations:

conda install -c conda-forge intake-esm

Feedback

If you encounter any errors or problems with intake-esm, please open an issue at the GitHub main repository.

Documentation Contents

Installation

Intake-esm can be installed from PyPI with pip:

python -m pip install intake-esm

It is also available from conda-forge for conda installations:

conda install -c conda-forge intake-esm

User Guide

The intake-esm user guide introduces the main concepts required for accessing Earth Sytem Model (ESM) data catalogs and loading data assets into xarray containers. This guide gives an overview of the functionality available. The guide is split into core and tutorials & examples sections.

Overview

Intake-esm is a data cataloging utility built on top of intake, pandas, and xarray. Intake-esm aims to facilitate:

  • the discovery of earth’s climate and weather datasets.

  • the ingestion of these datasets into xarray dataset containers.

It’s basic usage is shown below. To begin, let’s import intake:

import intake
Loading a catalog

At import time, intake-esm plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore() function. For demonstration purposes, we are going to use the catalog for Community Earth System Model Large ensemble (CESM LENS) dataset publicly available in Amazon S3.

Note

You can learn more about CESM LENS dataset in AWS S3 here

You can load data from an ESM Catalog by providing the URL to valid ESM Catalog:

catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(catalog_url)
col

aws-cesm1-le catalog with 56 dataset(s) from 429 asset(s):

unique
component 5
frequency 6
experiment 4
variable 73
path 414
variable_long_name 70
dim_per_tstep 3
start 12
end 13

The summary above tells us that this catalog contains over 400 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:

col.df.head()
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 atm daily 20C FLNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.... net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
1 atm daily 20C FLNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC... clearsky net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
2 atm daily 20C FLUT s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.... upwelling longwave flux at top of model 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
3 atm daily 20C FSNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.... net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
4 atm daily 20C FSNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC... clearsky net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
Finding unique entries for individual columns

To get unique values for given columns in the catalog, intake-esm provides a unique() method. This method returns a dictionary containing count, and unique values:

col.unique(columns=["component", "frequency", "experiment"])
{'component': {'count': 5,
  'values': ['atm', 'ice_nh', 'ice_sh', 'lnd', 'ocn']},
 'frequency': {'count': 6,
  'values': ['daily',
   'hourly6-1990-2005',
   'hourly6-2026-2035',
   'hourly6-2071-2080',
   'monthly',
   'static']},
 'experiment': {'count': 4, 'values': ['20C', 'CTRL', 'HIST', 'RCP85']}}
Loading datasets

Intake-esm implements convenience utilities for loading the query results into higher level xarray datasets. The logic for merging/concatenating the query results into higher level xarray datasets is provided in the input JSON file and is available under .aggregation_info property:

col.aggregation_info
AggregationInfo(groupby_attrs=['component', 'experiment', 'frequency'], variable_column_name='variable', aggregations=[{'type': 'union', 'attribute_name': 'variable', 'options': {'compat': 'override'}}], agg_columns=['variable'], aggregation_dict={'variable': {'type': 'union', 'options': {'compat': 'override'}}})
col.aggregation_info.aggregations
[{'type': 'union',
  'attribute_name': 'variable',
  'options': {'compat': 'override'}}]
# Dataframe columns used to determine groups of compatible datasets.
col.aggregation_info.groupby_attrs  # or col.groupby_attrs
['component', 'experiment', 'frequency']
# List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.
col.aggregation_info.agg_columns  # or col.agg_columns
['variable']

To load data assets into xarray datasets, we need to use the to_dataset_dict() method. This method returns a dictionary of aggregate xarray datasets as the name hints.

dset_dicts = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.frequency'
100.00% [4/4 00:00<00:00]
[key for key in dset_dicts.keys()]
['ice_nh.HIST.monthly',
 'ice_nh.20C.monthly',
 'lnd.HIST.monthly',
 'lnd.20C.monthly']

We can access a particular dataset as follows:

ds = dset_dicts["lnd.20C.monthly"]
print(ds)
<xarray.Dataset>
Dimensions:         (hist_interval: 2, lat: 192, levgrnd: 15, lon: 288, member_id: 40, time: 1032)
Coordinates:
  * lat             (lat) float64 -90.0 -89.06 -88.12 ... 88.12 89.06 90.0
  * lon             (lon) float32 0.0 1.25 2.5 3.75 ... 356.25 357.5 358.75
  * member_id       (member_id) int64 1 2 3 4 5 6 7 ... 35 101 102 103 104 105
  * time            (time) object 1920-01-16 12:00:00 ... 2005-12-16 12:00:00
    time_bounds     (time, hist_interval) object dask.array<chunksize=(1032, 2), meta=np.ndarray>
  * levgrnd         (levgrnd) float32 0.007100635 0.027925 ... 21.32647 35.17762
Dimensions without coordinates: hist_interval
Data variables:
    FSNO            (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    H2OSNO          (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    QRUNOFF         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    RAIN            (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    SNOW            (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
    SOILLIQ         (member_id, time, levgrnd, lat, lon) float32 dask.array<chunksize=(1, 40, 15, 192, 288), meta=np.ndarray>
    SOILWATER_10CM  (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 192, 288), meta=np.ndarray>
Attributes:
    hostname:                             tcs
    nco_openmp_thread_number:             1
    username:                             mudryk
    title:                                CLM History file information
    version:                              cesm1_1_1_alpha01g
    Conventions:                          CF-1.0
    case_title:                           UNSET
    revision_id:                          $Id: histFileMod.F90 40539 2012-09-...
    source:                               Community Land Model CLM4.0
    Surface_dataset:                      surfdata_0.9x1.25_simyr1850_c110921.nc
    Initial_conditions_dataset:           b.e11.B20TRC5CNBDRD.f09_g16.001.clm...
    comment:                              NOTE: None of the variables are wei...
    PFT_physiological_constants_dataset:  pft-physiology.c110425.nc
    NCO:                                  4.3.4
    intake_esm_varname:                   FSNO\nH2OSNO\nQRUNOFF\nRAIN\nSNOW\n...
    intake_esm_dataset_key:               lnd.20C.monthly

Let’s create a quick plot for a slice of the data:

ds.SNOW.isel(time=0, member_id=range(1, 24, 4)).plot(
    col="member_id", col_wrap=3, robust=True
)
<xarray.plot.facetgrid.FacetGrid at 0x1657d0820>
_images/overview_28_1.png

Search and Discovery

Intake-esm provides functionality to execute queries against the catalog. This notebook provided a more in-depth treatment of the search API in intake-esm, with detailed information that you can refer to when needed.

import warnings

warnings.filterwarnings("ignore")
import intake
catalog_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(catalog_url)
col

aws-cesm1-le catalog with 56 dataset(s) from 429 asset(s):

unique
component 5
frequency 6
experiment 4
variable 73
path 414
variable_long_name 70
dim_per_tstep 3
start 12
end 13
col.df.head()
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 atm daily 20C FLNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.... net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
1 atm daily 20C FLNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC... clearsky net longwave flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
2 atm daily 20C FLUT s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.... upwelling longwave flux at top of model 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
3 atm daily 20C FSNS s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.... net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
4 atm daily 20C FSNSC s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC... clearsky net solar flux at surface 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
Exact Match Keywords

The search() method allows the user to perform a query on a catalog using keyword arguments. The keyword argument names must be the names of the columns in the catalog. By default, the search() method looks for exact matches, and is case sensitive:

col.search(experiment="20C", variable_long_name="wind")

aws-cesm1-le catalog with 0 dataset(s) from 0 asset(s):

unique
component 0
frequency 0
experiment 0
variable 0
path 0
variable_long_name 0
dim_per_tstep 0
start 0
end 0

As you can see, the example above returns an empty catalog.

Substring Matches

In some cases, you may not know the exact term to look for. For such cases, inkake-esm supports searching for substring matches. With use of wildcards and/or regular expressions, we can find all items with a particular substring in a given column. Let’s search for:

  • entries from experiment = ‘20C’

  • all entries whose variable long name contains wind

col.search(experiment="20C", variable_long_name="wind*").df
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 atm daily 20C UBOT s3://ncar-cesm-lens/atm/daily/cesmLE-20C-UBOT.... lowest model level zonal wind 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
1 atm daily 20C WSPDSRFAV s3://ncar-cesm-lens/atm/daily/cesmLE-20C-WSPDS... horizontal total wind speed average at the sur... 2.0 1920-01-01 12:00:00 2005-12-31 12:00:00
2 atm hourly6-1990-2005 20C U s3://ncar-cesm-lens/atm/hourly6-1990-2005/cesm... zonal wind 3.0 1990-01-01 00:00:00 2006-01-01 00:00:00
3 atm monthly 20C U s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-U.zarr zonal wind 3.0 1920-01-16 12:00:00 2005-12-16 12:00:00
4 ocn monthly 20C TAUX s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
5 ocn monthly 20C TAUX2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
6 ocn monthly 20C TAUY s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
7 ocn monthly 20C TAUY2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00

Now, let’s search for:

  • entries from experiment = ‘20C’

  • all entries whose variable long name starts with wind

col.search(experiment="20C", variable_long_name="^wind").df
component frequency experiment variable path variable_long_name dim_per_tstep start end
0 ocn monthly 20C TAUX s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
1 ocn monthly 20C TAUX2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-x direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
2 ocn monthly 20C TAUY s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
3 ocn monthly 20C TAUY2 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-TAU... windstress**2 in grid-y direction 2.0 1920-01-16 12:00:00 2005-12-16 12:00:00
Enforce Query Criteria via require_all_on argument

By default intake-esm’s search() method returns entries that fulfill any of the criteria specified in the query. Intake-esm can return entries that fulfill all query criteria when the user supplies the require_all_on argument. The require_all_on parameter can be a dataframe column or a list of dataframe columns across which all elements must satisfy the query criteria. The require_all_on argument is best explained with the following example.

Let’s define a query for our collection that requests multiple variable_ids and multiple experiment_ids from the Omon table_id, all from 3 different source_ids:

catalog_url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(catalog_url)
col

pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):

unique
activity_id 17
institution_id 35
source_id 84
experiment_id 160
member_id 549
table_id 37
variable_id 707
grid_label 10
zstore 402033
dcpp_init_year 60
version 606
# Define our query
query = dict(
    variable_id=["thetao", "o2"],
    experiment_id=["historical", "ssp245", "ssp585"],
    table_id=["Omon"],
    source_id=["ACCESS-ESM1-5", "AWI-CM-1-1-MR", "FGOALS-f3-L"],
)

Now, let’s use this query to search for all assets in the collection that satisfy any combination of these requests (i.e., with require_all_on=None, which is the default):

col_subset = col.search(**query)
col_subset

pangeo-cmip6 catalog with 8 dataset(s) from 76 asset(s):

unique
activity_id 2
institution_id 3
source_id 3
experiment_id 3
member_id 20
table_id 1
variable_id 2
grid_label 1
zstore 76
dcpp_init_year 0
version 14
# Group by `source_id` and count unique values for a few columns
col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id"]
].nunique()
experiment_id variable_id table_id
source_id
ACCESS-ESM1-5 3 2 1
AWI-CM-1-1-MR 3 1 1
FGOALS-f3-L 2 1 1

As you can see, the search results above include source_ids for which we only have one of the two variables, and one or two of the three experiments.

We can tell intake-esm to discard any source_id that doesn’t have both variables ["thetao", "o2"] and all three experiments ["historical", "ssp245", "ssp585"] by passing require_all_on=["source_id"] to the search method:

col_subset = col.search(require_all_on=["source_id"], **query)
col_subset

pangeo-cmip6 catalog with 3 dataset(s) from 63 asset(s):

unique
activity_id 2
institution_id 1
source_id 1
experiment_id 3
member_id 20
table_id 1
variable_id 2
grid_label 1
zstore 63
dcpp_init_year 0
version 9
col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id"]
].nunique()
experiment_id variable_id table_id
source_id
ACCESS-ESM1-5 3 2 1

Notice that with the require_all_on=["source_id"] option, the only source_id that was returned by our query was the source_id for which all of the variables and experiments were found.

Working with multi-variable assets

In addition to catalogs of data assets (files) in time-series (single-variable) format, intake-esm supports catalogs with data assets in time-slice (history) format and/or files with multiple variables. For intake-esm to properly work with multi-variable assets,

  • the variable_column of the catalog must contain iterables (list, tuple, set) of values.

  • the user must specifiy a dictionary of functions for converting values in certain columns into iterables. This is done via the csv_kwargs argument.

In the example below, we are are going to use the following catalog to demonstrate how to work with multi-variable assets:

# Look at the catalog on disk
!cat multi-variable-catalog.csv
experiment,case,component,stream,variable,member_id,path,time_range
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-O2.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050001-050012.nc,050001-050012
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'PO4']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-NO2-PO4.050101-050112.nc,050101-050112
CTRL,b.e11.B1850C5CN.f09_g16.005,ocn,pop.h,"['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'TEMP', 'SiO3']",5,../../../tests/sample_data/cesm-multi-variables/b.e11.B1850C5CN.f09_g16.005.pop.h.SHF-TEMP-SiO3.050001-050012.nc,050001-050012

As you can see, the variable column contains a list of varibles, and this list was serialized as a string: "['SHF', 'REGION_MASK', 'ANGLE', 'DXU', 'KMT', 'NO2', 'O2']".

Loading a catalog

To load a catalog with multiple variable files, we must pass additional information to open_esm_datastore via the csv_kwargs argument. We are going to specify a dictionary of functions for converting values in variable column into iterables. We use the literal_eval function from the standard ast module:

import intake
import ast
col = intake.open_esm_datastore(
    "multi-variable-collection.json",
    csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
col

sample-multi-variable-cesm1-lens catalog with 1 dataset(s) from 5 asset(s):

unique
experiment 1
case 1
component 1
stream 1
variable 10
member_id 1
path 5
time_range 2
col.df.head()
experiment case component stream variable member_id path time_range
0 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
1 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050101-050112
2 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, PO4) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
3 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, PO4) 5 ../../../tests/sample_data/cesm-multi-variable... 050101-050112
4 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012

The in-memory representation of the catalog contains variable with tuple of values. To confirm that intake-esm has registered this catalog with multiple variable assets, we can the ._multiple_variable_assets property:

col._multiple_variable_assets
True
Searching

The search functionatilty works in the same way:

col_subset = col.search(variable=["O2", "SiO3"])
col_subset.df
experiment case component stream variable member_id path time_range
0 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
1 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, NO2, O2) 5 ../../../tests/sample_data/cesm-multi-variable... 050101-050112
2 CTRL b.e11.B1850C5CN.f09_g16.005 ocn pop.h (SHF, REGION_MASK, ANGLE, DXU, KMT, TEMP, SiO3) 5 ../../../tests/sample_data/cesm-multi-variable... 050001-050012
Loading assets into xarray datasets

Loading data assets into xarray datasets works in the same way too:

col_subset.to_dataset_dict(cdf_kwargs={})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.stream'
100.00% [1/1 00:00<00:00]
{'ocn.CTRL.pop.h': <xarray.Dataset>
 Dimensions:    (member_id: 1, nlat: 2, nlon: 2, time: 24)
 Coordinates:
   * time       (time) object 0500-02-01 00:00:00 ... 0502-02-01 00:00:00
     TLAT       (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
     TLONG      (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
     ULAT       (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
     ULONG      (nlat, nlon) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
   * member_id  (member_id) int64 5
 Dimensions without coordinates: nlat, nlon
 Data variables:
     O2         (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 12, 2, 2), meta=np.ndarray>
     SiO3       (member_id, time, nlat, nlon) float32 dask.array<chunksize=(1, 24, 2, 2), meta=np.ndarray>
 Attributes:
     calendar:                  All years have exactly  365 days.
     Conventions:               CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc...
     tavg_sum:                  2678400.0
     nco_openmp_thread_number:  1
     contents:                  Diagnostic and Prognostic Variables
     cell_methods:              cell_methods = time: mean ==> the variable val...
     NCO:                       4.3.4
     start_time:                This dataset was created on 2013-05-28 at 02:4...
     tavg_sum_qflux:            2678400.0
     revision:                  $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy...
     intake_esm_varname:        O2\nSiO3
     source:                    CCSM POP2, the CCSM Ocean Component
     history:                   Fri Oct 11 01:05:51 2013: /glade/apps/opt/nco/...
     title:                     b.e11.B1850C5CN.f09_g16.005
     nsteps_total:              1953500
     intake_esm_dataset_key:    ocn.CTRL.pop.h}

Load CMIP6 Data with Intake ESM

This notebook demonstrates how to access Google Cloud CMIP6 data using intake-esm.

Loading a catalog
import warnings

warnings.filterwarnings("ignore")
import intake
url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col

pangeo-cmip6 catalog with 6539 dataset(s) from 402033 asset(s):

unique
activity_id 17
institution_id 35
source_id 84
experiment_id 160
member_id 549
table_id 37
variable_id 707
grid_label 10
zstore 402033
dcpp_init_year 60
version 606

The summary above tells us that this catalog contains over 268,000 data assets. We can get more information on the individual data assets contained in the catalog by calling the underlying dataframe created when it is initialized:

Catalog Contents
col.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 AerChemMIP AS-RCEC TaiESM1 histSST r1i1p1f1 AERmon od550aer gn gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/... NaN 20200310
1 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrbc gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190718
2 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrdust gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127
3 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmroa gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190809
4 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrso4 gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127

The first data asset listed in the catalog contains:

  • the ambient aerosol optical thickness at 550nm (variable_id='od550aer'), as a function of latitude, longitude, time,

  • in an individual climate model experiment with the Taiwan Earth System Model 1.0 model (source_id='TaiESM1'),

  • forced by the Historical transient with SSTs prescribed from historical experiment (experiment_id='histSST'),

  • developed by the Taiwan Research Center for Environmental Changes (instution_id='AS-RCEC'),

  • run as part of the Aerosols and Chemistry Model Intercomparison Project (activity_id='AerChemMIP')

And is located in Google Cloud Storage at gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/od550aer/gn/.

Finding unique entries

Let’s query the data to see what models (source_id), experiments (experiment_id) and temporal frequencies (table_id) are available.

import pprint

uni_dict = col.unique(["source_id", "experiment_id", "table_id"])
pprint.pprint(uni_dict, compact=True)
{'experiment_id': {'count': 160,
                   'values': ['1pctCO2', '1pctCO2-bgc', '1pctCO2-cdr',
                              '1pctCO2-rad', 'abrupt-0p5xCO2', 'abrupt-2xCO2',
                              'abrupt-4xCO2', 'abrupt-solm4p', 'abrupt-solp4p',
                              'amip', 'amip-4xCO2', 'amip-future4K',
                              'amip-hist', 'amip-lwoff', 'amip-m4K', 'amip-p4K',
                              'amip-p4K-lwoff', 'aqua-4xCO2', 'aqua-control',
                              'aqua-control-lwoff', 'aqua-p4K',
                              'aqua-p4K-lwoff', 'control-1950', 'dcppA-assim',
                              'dcppA-hindcast', 'dcppC-amv-ExTrop-neg',
                              'dcppC-amv-ExTrop-pos', 'dcppC-amv-Trop-neg',
                              'dcppC-amv-Trop-pos', 'dcppC-amv-neg',
                              'dcppC-amv-pos', 'dcppC-atl-control',
                              'dcppC-atl-pacemaker', 'dcppC-hindcast-noAgung',
                              'dcppC-hindcast-noElChichon',
                              'dcppC-hindcast-noPinatubo',
                              'dcppC-ipv-NexTrop-neg', 'dcppC-ipv-NexTrop-pos',
                              'dcppC-ipv-neg', 'dcppC-ipv-pos',
                              'dcppC-pac-control', 'dcppC-pac-pacemaker',
                              'deforest-globe', 'esm-hist', 'esm-pi-CO2pulse',
                              'esm-pi-cdr-pulse', 'esm-piControl',
                              'esm-piControl-spinup', 'esm-ssp585',
                              'esm-ssp585-ssp126Lu', 'faf-all', 'faf-heat',
                              'faf-heat-NA0pct', 'faf-heat-NA50pct',
                              'faf-passiveheat', 'faf-stress', 'faf-water',
                              'futSST-pdSIC', 'highresSST-future',
                              'highresSST-present', 'hist-1950', 'hist-1950HC',
                              'hist-CO2', 'hist-GHG', 'hist-GHG-cmip5',
                              'hist-aer', 'hist-aer-cmip5', 'hist-bgc',
                              'hist-nat', 'hist-nat-cmip5', 'hist-noLu',
                              'hist-piAer', 'hist-piNTCF', 'hist-resIPO',
                              'hist-sol', 'hist-stratO3', 'hist-totalO3',
                              'hist-volc', 'histSST', 'histSST-1950HC',
                              'histSST-piAer', 'histSST-piCH4',
                              'histSST-piNTCF', 'histSST-piO3', 'historical',
                              'historical-cmip5', 'historical-ext', 'land-hist',
                              'land-hist-altStartYear', 'land-noLu', 'lgm',
                              'lig127k', 'midHolocene', 'omip1', 'pa-futArcSIC',
                              'pa-pdSIC', 'past1000', 'pdSST-futAntSIC',
                              'pdSST-futArcSIC', 'pdSST-pdSIC',
                              'pdSST-piAntSIC', 'pdSST-piArcSIC',
                              'piClim-2xDMS', 'piClim-2xNOx', 'piClim-2xVOC',
                              'piClim-2xdust', 'piClim-2xfire', 'piClim-2xss',
                              'piClim-4xCO2', 'piClim-BC', 'piClim-CH4',
                              'piClim-HC', 'piClim-N2O', 'piClim-NOx',
                              'piClim-NTCF', 'piClim-O3', 'piClim-OC',
                              'piClim-SO2', 'piClim-VOC', 'piClim-aer',
                              'piClim-anthro', 'piClim-control', 'piClim-ghg',
                              'piClim-histaer', 'piClim-histall',
                              'piClim-histghg', 'piClim-histnat', 'piClim-lu',
                              'piControl', 'piControl-cmip5',
                              'piControl-spinup', 'piSST-pdSIC', 'piSST-piSIC',
                              'rcp26-cmip5', 'rcp45-cmip5', 'rcp85-cmip5',
                              'ssp119', 'ssp126', 'ssp126-ssp370Lu', 'ssp245',
                              'ssp245-GHG', 'ssp245-aer', 'ssp245-cov-fossil',
                              'ssp245-cov-modgreen', 'ssp245-cov-strgreen',
                              'ssp245-covid', 'ssp245-nat', 'ssp245-stratO3',
                              'ssp370', 'ssp370-lowNTCF', 'ssp370-ssp126Lu',
                              'ssp370SST', 'ssp370SST-lowCH4',
                              'ssp370SST-lowNTCF', 'ssp370SST-ssp126Lu',
                              'ssp370pdSST', 'ssp434', 'ssp460', 'ssp534-over',
                              'ssp585']},
 'source_id': {'count': 84,
               'values': ['ACCESS-CM2', 'ACCESS-ESM1-5', 'AWI-CM-1-1-MR',
                          'AWI-ESM-1-1-LR', 'BCC-CSM2-HR', 'BCC-CSM2-MR',
                          'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0',
                          'CESM1-1-CAM5-CMIP5', 'CESM2', 'CESM2-FV2',
                          'CESM2-WACCM', 'CESM2-WACCM-FV2', 'CIESM',
                          'CMCC-CM2-HR4', 'CMCC-CM2-SR5', 'CMCC-CM2-VHR4',
                          'CMCC-ESM2', 'CNRM-CM6-1', 'CNRM-CM6-1-HR',
                          'CNRM-ESM2-1', 'CanESM5', 'CanESM5-CanOE', 'E3SM-1-0',
                          'E3SM-1-1', 'E3SM-1-1-ECA', 'EC-Earth3',
                          'EC-Earth3-AerChem', 'EC-Earth3-CC', 'EC-Earth3-LR',
                          'EC-Earth3-Veg', 'EC-Earth3-Veg-LR', 'EC-Earth3P',
                          'EC-Earth3P-HR', 'EC-Earth3P-VHR', 'ECMWF-IFS-HR',
                          'ECMWF-IFS-LR', 'FGOALS-f3-H', 'FGOALS-f3-L',
                          'FGOALS-g3', 'FIO-ESM-2-0', 'GFDL-AM4', 'GFDL-CM4',
                          'GFDL-CM4C192', 'GFDL-ESM2M', 'GFDL-ESM4',
                          'GFDL-OM4p5B', 'GISS-E2-1-G', 'GISS-E2-1-G-CC',
                          'GISS-E2-1-H', 'GISS-E2-2-G', 'HadGEM3-GC31-HM',
                          'HadGEM3-GC31-LL', 'HadGEM3-GC31-LM',
                          'HadGEM3-GC31-MM', 'IITM-ESM', 'INM-CM4-8',
                          'INM-CM5-0', 'INM-CM5-H', 'IPSL-CM5A2-INCA',
                          'IPSL-CM6A-ATM-HR', 'IPSL-CM6A-LR',
                          'IPSL-CM6A-LR-INCA', 'KACE-1-0-G', 'KIOST-ESM',
                          'MCM-UA-1-0', 'MIROC-ES2L', 'MIROC6',
                          'MPI-ESM-1-2-HAM', 'MPI-ESM1-2-HR', 'MPI-ESM1-2-LR',
                          'MPI-ESM1-2-XR', 'MRI-AGCM3-2-H', 'MRI-AGCM3-2-S',
                          'MRI-ESM2-0', 'NESM3', 'NorCPM1', 'NorESM1-F',
                          'NorESM2-LM', 'NorESM2-MM', 'SAM0-UNICON', 'TaiESM1',
                          'UKESM1-0-LL']},
 'table_id': {'count': 37,
              'values': ['3hr', '6hrLev', '6hrPlev', '6hrPlevPt', 'AERday',
                         'AERhr', 'AERmon', 'AERmonZ', 'Aclim', 'Amon', 'CF3hr',
                         'CFday', 'CFmon', 'E1hrClimMon', 'E3hr', 'Eclim',
                         'Eday', 'EdayZ', 'Efx', 'Emon', 'EmonZ', 'Eyr',
                         'IfxGre', 'ImonGre', 'LImon', 'Lmon', 'Oclim', 'Oday',
                         'Odec', 'Ofx', 'Omon', 'Oyr', 'SIclim', 'SIday',
                         'SImon', 'day', 'fx']}}
Searching for specific datasets

In the example below, we are are going to search for the following:

  • variables: o2 which stands for mole_concentration_of_dissolved_molecular_oxygen_in_sea_water

  • experiments: ['historical', 'ssp585']:

    • historical: all forcing of the recent past.

    • ssp585: emission-driven RCP8.5 based on SSP5.

  • table_id: Oyr which stands for annual mean variables on the ocean grid.

  • grid_label: gn which stands for data reported on a model’s native grid.

For more details on the CMIP6 vocabulary, please check this website, and Core Controlled Vocabularies (CVs) for use in CMIP6 GitHub repository.

cat = col.search(
    experiment_id=["historical", "ssp585"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
)

cat

pangeo-cmip6 catalog with 23 dataset(s) from 149 asset(s):

unique
activity_id 2
institution_id 11
source_id 12
experiment_id 2
member_id 41
table_id 1
variable_id 1
grid_label 1
zstore 149
dcpp_init_year 0
version 20
cat.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CMIP CCCma CanESM5-CanOE historical r1i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical... NaN 20190429
1 CMIP CCCma CanESM5-CanOE historical r2i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical... NaN 20190429
2 CMIP CCCma CanESM5-CanOE historical r3i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5-CanOE/historical... NaN 20190429
3 CMIP CCCma CanESM5 historical r10i1p1f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1... NaN 20190429
4 CMIP CCCma CanESM5 historical r10i1p2f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1... NaN 20190429
Loading datasets Using to_dataset_dict()
dset_dict = cat.to_dataset_dict(
    zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [23/23 00:06<00:00]
[key for key in dset_dict.keys()]
['ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp585.Oyr.gn',
 'ScenarioMIP.MRI.MRI-ESM2-0.ssp585.Oyr.gn',
 'ScenarioMIP.MPI-M.MPI-ESM1-2-LR.ssp585.Oyr.gn',
 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Oyr.gn',
 'ScenarioMIP.NCC.NorESM2-MM.ssp585.Oyr.gn',
 'CMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.historical.Oyr.gn',
 'CMIP.NCC.NorESM2-MM.historical.Oyr.gn',
 'CMIP.CCCma.CanESM5-CanOE.historical.Oyr.gn',
 'ScenarioMIP.CCCma.CanESM5.ssp585.Oyr.gn',
 'ScenarioMIP.CSIRO.ACCESS-ESM1-5.ssp585.Oyr.gn',
 'ScenarioMIP.CCCma.CanESM5-CanOE.ssp585.Oyr.gn',
 'ScenarioMIP.NCAR.CESM2.ssp585.Oyr.gn',
 'ScenarioMIP.DWD.MPI-ESM1-2-HR.ssp585.Oyr.gn',
 'CMIP.MIROC.MIROC-ES2L.historical.Oyr.gn',
 'CMIP.MPI-M.MPI-ESM1-2-HR.historical.Oyr.gn',
 'CMIP.CSIRO.ACCESS-ESM1-5.historical.Oyr.gn',
 'ScenarioMIP.MIROC.MIROC-ES2L.ssp585.Oyr.gn',
 'CMIP.MPI-M.MPI-ESM1-2-LR.historical.Oyr.gn',
 'ScenarioMIP.NCC.NorESM2-LM.ssp585.Oyr.gn',
 'CMIP.NCC.NorESM2-LM.historical.Oyr.gn',
 'CMIP.CCCma.CanESM5.historical.Oyr.gn',
 'CMIP.MRI.MRI-ESM2-0.historical.Oyr.gn',
 'CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn']

We can access a particular dataset as follows:

ds = dset_dict["CMIP.CCCma.CanESM5.historical.Oyr.gn"]
print(ds)
<xarray.Dataset>
Dimensions:    (i: 360, j: 291, lev: 45, member_id: 35, time: 165)
Coordinates:
  * i          (i) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
  * j          (j) int32 0 1 2 3 4 5 6 7 8 ... 283 284 285 286 287 288 289 290
    latitude   (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * lev        (lev) float64 3.047 9.454 16.36 ... 5.126e+03 5.375e+03 5.625e+03
    longitude  (j, i) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time       (time) object 1850-07-02 12:00:00 ... 2014-07-02 12:00:00
  * member_id  (member_id) <U9 'r10i1p1f1' 'r10i1p2f1' ... 'r9i1p1f1' 'r9i1p2f1'
Data variables:
    o2         (member_id, time, lev, j, i) float32 dask.array<chunksize=(1, 12, 45, 291, 360), meta=np.ndarray>
Attributes:
    branch_time_in_child:        0.0
    realm:                       ocnBgchem
    parent_time_units:           days since 1850-01-01 0:0:0.0
    CCCma_runid:                 p2-his09
    title:                       CanESM5 output prepared for CMIP6
    Conventions:                 CF-1.7 CMIP-6.2
    version:                     v20190429
    CCCma_parent_runid:          p2-pictrl
    variant_label:               r9i1p2f1
    realization_index:           9
    status:                      2019-10-25;created;by nhn2@columbia.edu
    parent_experiment_id:        piControl
    institution_id:              CCCma
    branch_method:               Spin-up documentation
    experiment:                  all-forcing simulation of the recent past
    forcing_index:               1
    initialization_index:        1
    product:                     model-output
    YMDH_branch_time_in_child:   1850:01:01:00
    frequency:                   yr
    activity_id:                 CMIP
    references:                  Geophysical Model Development Special issue ...
    contact:                     ec.cccma.info-info.ccmac.ec@canada.ca
    source_id:                   CanESM5
    data_specs_version:          01.00.29
    cmor_version:                3.4.0
    external_variables:          areacello volcello
    CCCma_model_hash:            Unknown
    YMDH_branch_time_in_parent:  5950:01:01:00
    mip_era:                     CMIP6
    intake_esm_varname:          ['o2']
    variable_id:                 o2
    grid_label:                  gn
    license:                     CMIP6 model data produced by The Government ...
    table_id:                    Oyr
    nominal_resolution:          100 km
    grid:                        ORCA1 tripolar grid, 1 deg with refinement t...
    source_type:                 AOGCM
    parent_mip_era:              CMIP6
    sub_experiment:              none
    parent_activity_id:          CMIP
    experiment_id:               historical
    institution:                 Canadian Centre for Climate Modelling and An...
    table_info:                  Creation Date:(20 February 2019) MD5:374fbe5...
    further_info_url:            https://furtherinfo.es-doc.org/CMIP6.CCCma.C...
    creation_date:               2019-05-30T08:58:45Z
    parent_source_id:            CanESM5
    branch_time_in_parent:       1496500.0
    history:                     2019-05-02T13:53:53Z ;rewrote data to be con...
    source:                      CanESM5 (2019): \naerosol: interactive\natmo...
    tracking_id:                 hdl:21.14100/41426118-701c-482b-ae16-82932e4...
    sub_experiment_id:           none
    intake_esm_dataset_key:      CMIP.CCCma.CanESM5.historical.Oyr.gn

Let’s create a quick plot for a slice of the data:

ds.o2.isel(time=0, lev=0, member_id=range(1, 24, 4)).plot(
    col="member_id", col_wrap=3, robust=True
)
<xarray.plot.facetgrid.FacetGrid at 0x7fd599a2e0d0>
_images/cmip6-tutorial_19_1.png
Using custom preprocessing functions

When comparing many models it is often necessary to preprocess (e.g. rename certain variables) them before running some analysis step. The preprocess argument lets the user pass a function, which is executed for each loaded asset before aggregations.

cat_pp = col.search(
    experiment_id=["historical"],
    table_id="Oyr",
    variable_id="o2",
    grid_label="gn",
    source_id=["IPSL-CM6A-LR", "CanESM5"],
    member_id="r10i1p1f1",
)
cat_pp.df
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 CMIP CCCma CanESM5 historical r10i1p1f1 Oyr o2 gn gs://cmip6/CMIP/CCCma/CanESM5/historical/r10i1... NaN 20190429
1 CMIP IPSL IPSL-CM6A-LR historical r10i1p1f1 Oyr o2 gn gs://cmip6/CMIP/IPSL/IPSL-CM6A-LR/historical/r... NaN 20180803
# load the example
dset_dict_raw = cat_pp.to_dataset_dict(zarr_kwargs={"consolidated": True})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [2/2 00:00<00:00]
for k, ds in dset_dict_raw.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn
	dimensions=['member_id', 'olevel', 'time', 'x', 'y']

dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn
	dimensions=['i', 'j', 'lev', 'member_id', 'time']

Note

Note that both models follow a different naming scheme. We can define a little helper function and pass it to .to_dataset_dict() to fix this. For demonstration purposes we will focus on the vertical level dimension which is called lev in CanESM5 and olevel in IPSL-CM6A-LR.

def helper_func(ds):
    """Rename `olevel` dim to `lev`"""
    ds = ds.copy()
    # a short example
    if "olevel" in ds.dims:
        ds = ds.rename({"olevel": "lev"})
    return ds
dset_dict_fixed = cat_pp.to_dataset_dict(
    zarr_kwargs={"consolidated": True}, preprocess=helper_func
)
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [2/2 00:00<00:00]
for k, ds in dset_dict_fixed.items():
    print(f"dataset key={k}\n\tdimensions={sorted(list(ds.dims))}\n")
dataset key=CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn
	dimensions=['lev', 'member_id', 'time', 'x', 'y']

dataset key=CMIP.CCCma.CanESM5.historical.Oyr.gn
	dimensions=['i', 'j', 'lev', 'member_id', 'time']

This was just an example for one dimension.

Note

Check out cmip6-preprocessing package for a full renaming function for all available CMIP6 models and some other utilities.

Manipulating DataFrame (in-memory catalog)

import warnings

warnings.filterwarnings("ignore")
import intake

The in-memory representation of an Earth System Model (ESM) catalog is a pandas dataframe, and is accessible via the .df property:

url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col.df.head()
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label zstore dcpp_init_year version
0 AerChemMIP AS-RCEC TaiESM1 histSST r1i1p1f1 AERmon od550aer gn gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/... NaN 20200310
1 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrbc gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190718
2 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrdust gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127
3 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmroa gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20190809
4 AerChemMIP BCC BCC-ESM1 histSST r1i1p1f1 AERmon mmrso4 gn gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i... NaN 20191127

In this notebook we will go through some examples showing how to manipulate this dataframe outside of intake-esm.

Use Case 1: Complex Search Queries

Let’s say we are interested in datasets with the following attributes:

  • experiment_id=["historical"]

  • table_id="Amon"

  • variable_id="tas"

  • source_id=['TaiESM1', 'AWI-CM-1-1-MR', 'AWI-ESM-1-1-LR', 'BCC-CSM2-MR', 'BCC-ESM1', 'CAMS-CSM1-0', 'CAS-ESM2-0', 'UKESM1-0-LL']

In addition to these attributes, we are interested in the first ensemble member (member_id) of each model (source_id) only.

This can be achieved in two steps:

Step 1: Run a query against the catalog

We can run a query against the catalog:

col_subset = col.search(
    experiment_id=["historical"],
    table_id="Amon",
    variable_id="tas",
    source_id=[
        "TaiESM1",
        "AWI-CM-1-1-MR",
        "AWI-ESM-1-1-LR",
        "BCC-CSM2-MR",
        "BCC-ESM1",
        "CAMS-CSM1-0",
        "CAS-ESM2-0",
        "UKESM1-0-LL",
    ],
)
col_subset

pangeo-cmip6 catalog with 9 dataset(s) from 38 asset(s):

unique
activity_id 1
institution_id 7
source_id 8
experiment_id 1
member_id 23
table_id 1
variable_id 1
grid_label 1
zstore 38
dcpp_init_year 0
version 24
Step 2: Select the first member_id for each source_id

The subsetted catalog contains source_id with the following number of member_id per source_id:

col_subset.df.groupby("source_id")["member_id"].nunique()
source_id
AWI-CM-1-1-MR      5
AWI-ESM-1-1-LR     1
BCC-CSM2-MR        3
BCC-ESM1           3
CAMS-CSM1-0        3
CAS-ESM2-0         4
TaiESM1            1
UKESM1-0-LL       18
Name: member_id, dtype: int64

To get the first member_id for each source_id, we group the dataframe by source_id and use the .first() function to retrieve the first member_id:

grouped = col_subset.df.groupby(["source_id"])
df = grouped.first().reset_index()

# Confirm that we have one ensemble member per source_id

df.groupby("source_id")["member_id"].nunique()
source_id
AWI-CM-1-1-MR     1
AWI-ESM-1-1-LR    1
BCC-CSM2-MR       1
BCC-ESM1          1
CAMS-CSM1-0       1
CAS-ESM2-0        1
TaiESM1           1
UKESM1-0-LL       1
Name: member_id, dtype: int64
Step 3: Attach the new dataframe to our catalog object
col_subset.df = df
col_subset

pangeo-cmip6 catalog with 8 dataset(s) from 8 asset(s):

unique
source_id 8
activity_id 1
institution_id 6
experiment_id 1
member_id 2
table_id 1
variable_id 1
grid_label 1
zstore 8
dcpp_init_year 0
version 8
dsets = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
[key for key in dsets]
--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
100.00% [8/8 00:00<00:00]
['CMIP.BCC.BCC-ESM1.historical.Amon.gn',
 'CMIP.AWI.AWI-CM-1-1-MR.historical.Amon.gn',
 'CMIP.CAMS.CAMS-CSM1-0.historical.Amon.gn',
 'CMIP.AWI.AWI-ESM-1-1-LR.historical.Amon.gn',
 'CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn',
 'CMIP.MOHC.UKESM1-0-LL.historical.Amon.gn',
 'CMIP.CAS.CAS-ESM2-0.historical.Amon.gn',
 'CMIP.AS-RCEC.TaiESM1.historical.Amon.gn']
print(dsets["CMIP.CAS.CAS-ESM2-0.historical.Amon.gn"])
<xarray.Dataset>
Dimensions:    (lat: 128, lon: 256, member_id: 1, time: 1980)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 -90.0 -88.58 -87.17 -85.75 ... 87.17 88.58 90.0
  * lon        (lon) float64 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
  * member_id  (member_id) <U8 'r1i1p1f1'
Data variables:
    tas        (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 381, 128, 256), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7 CMIP-6.2
    activity_id:             CMIP
    branch_method:           standard
    branch_time_in_child:    0.0
    branch_time_in_parent:   0.0
    cmor_version:            3.5.0
    contact:                 Zhang He (zhanghe@mail.iap.ac.cn)
    creation_date:           2020-03-02T12:28:26Z
    data_specs_version:      01.00.31
    experiment:              all-forcing simulation of the recent past
    experiment_id:           historical
    external_variables:      areacella
    forcing_index:           1
    frequency:               mon
    further_info_url:        https://furtherinfo.es-doc.org/CMIP6.CAS.CAS-ESM...
    grid:                    native atmosphere regular grid (128x256 latxlon)
    grid_label:              gn
    history:                 2020-03-02T12:28:26Z ;rewrote data to be consist...
    initialization_index:    1
    institution:             Chinese Academy of Sciences, Beijing 100029, China
    institution_id:          CAS
    license:                 CMIP6 model data produced by Institute of Atmosp...
    mip_era:                 CMIP6
    nominal_resolution:      100 km
    parent_activity_id:      CMIP
    parent_experiment_id:    piControl
    parent_mip_era:          CMIP6
    parent_source_id:        CAS-ESM2-0
    parent_time_units:       days since 1850-01-01
    parent_variant_label:    r1i1p1f1
    physics_index:           1
    product:                 model-output
    realization_index:       1
    realm:                   atmos
    run_variant:             3rd realization
    source:                  CAS-ESM 2.0 (2019): \naerosol: IAP AACM\natmos: ...
    source_id:               CAS-ESM2-0
    source_type:             AOGCM
    status:                  2020-05-02;created; by gcs.cmip6.ldeo@gmail.com
    sub_experiment:          none
    sub_experiment_id:       none
    table_id:                Amon
    table_info:              Creation Date:(24 July 2019) MD5:b9834a2d0728c0d...
    title:                   CAS-ESM2-0 output prepared for CMIP6
    tracking_id:             hdl:21.14100/22e89a1b-f73e-45be-84dc-7d0aabbeea9d
    variable_id:             tas
    variant_label:           r1i1p1f1
    intake_esm_varname:      ['tas']
    intake_esm_dataset_key:  CMIP.CAS.CAS-ESM2-0.historical.Amon.gn

Supplemental Guide

Frequently Asked Questions

How do I create my own catalog?

Intake-esm catalogs include two pieces:

  1. An ESM-Collection file: an ESM-Collection file is a simple json file that provides metadata about the catalog. The specification for this json file is found in the esm-collection-spec repository.

  2. A catalog file: the catalog file is a CSV file that lists the catalog contents. This file includes one row per dataset granule (e.g. a NetCDF file or Zarr dataset). The columns in this CSV must match the attributes and assets listed in the ESM-Collection file. A short example of a catalog file is shown below::

    activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year
    AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,ch4,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/ch4/gn/,
    AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,clt,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/clt/gn/,
    AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,co2,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/co2/gn/,
    AerChemMIP,BCC,BCC-ESM1,piClim-CH4,r1i1p1f1,Amon,evspsbl,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/piClim-CH4/r1i1p1f1/Amon/evspsbl/gn/,
    ...
    
Is there a list of existing catalogs?

The table below is an incomplete list of existing catalogs. Please feel free to add to this list or raise an issue on GitHub.

CMIP6-GLADE

  • Description: CMIP6 data accessible on the NCAR’s GLADE disk storage system

  • Platform: NCAR-GLADE

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json

  • Data Format: netCDF

  • Documentation Page: https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html

CMIP6-CESM2-Timeseries

  • Description: CESM2 raw output (non-cmorized) that went into CMIP6 data

  • Platform: NCAR-CAMPAIGN

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json

  • Data Format: netCDF

  • Documentation Page: http://www.cesm.ucar.edu/models/cesm2/

CMIP5-GLADE

  • Description: CMIP5 data accessible on the NCAR’s GLADE disk storage system

  • Platform: NCAR-GLADE

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip5.json

  • Data Format: netCDF

  • Documentation Page: https://pcmdi.llnl.gov/mips/cmip5/guide.html

CESM1-LENS-AWS

  • Description: CESM1 Large Ensemble data publicly available on Amazon S3

  • Platform: AWS S3 (us-west-2 region)

  • Catalog path or url: https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json

  • Data Format: Zarr

  • Documentation Page: https://doi.org/10.26024/wt24-5j82

CESM1-LENS-GLADE

  • Description: CESM1 Large Ensemble data stored on NCAR’s GLADE disk storage system

  • Platform: NCAR-GLADE

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm1-le.json

  • Data Format: netCDF

  • Documentation Page: https://doi.org/10.5065/d6j101d1

CMIP6-GCP

  • Description: CMIP6 Zarr data residing in Pangeo’s Google Storage

  • Platform: Google Cloud Platform

  • Catalog path or url: https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json

  • Data Format: Zarr

  • Documentation Page: https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html

CMIP6-MISTRAL

  • Description: CMIP6 data accessible on the DKRZ’s MISTRAL disk storage system

  • Platform: DKRZ (German Climate Computing Centre)-MISTRAL

  • Catalog path or url: /home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.json

  • Data Format: netCDF

  • Documentation Page: https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html

CMIP5-MISTRAL

  • Description: CMIP5 data accessible on the DKRZ’s MISTRAL disk storage system

  • Platform: DKRZ (German Climate Computing Centre)-MISTRAL

  • Catalog path or url: /home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip5.json

  • Data Format: netCDF

  • Documentation Page: https://pcmdi.llnl.gov/mips/cmip5/guide.html

MiKlip-MISTRAL

  • Description: Data from MiKlip projects at the Max Planck Institute for Meteorology (MPI-M)

  • Platform: DKRZ (German Climate Computing Centre)-MISTRAL

  • Catalog path or url: /home/mpim/m300524/intake-esm-datastore/catalogs/mistral-miklip.json

  • Data Format: netCDF

  • Documentation Page: https://www.fona-miklip.de/

MPI-GE-MISTRAL

  • Description: Max Planck Institute Grand Ensemble cmorized by CMIP5-standards

  • Platform: DKRZ (German Climate Computing Centre)-MISTRAL

  • Catalog path or url: /home/mpim/m300524/intake-esm-datastore/catalogs/mistral-MPI-GE.json

  • Data Format: netCDF

  • Documentation Page: https://doi.org/10/gf3kgt

CMIP6-LDEO-OpenDAP

  • Description: CMIP6 data accessible via Hyrax OpenDAP Server at Lamont-Doherty Earth Observatory

  • Platform: LDEO-OpenDAP

  • Catalog path or url: http://haden.ldeo.columbia.edu/catalogs/hyrax_cmip6.json

  • Data Format: netCDF

  • Documentation Page: https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html

Note

Some of these catalogs are also stored in intake-esm-datastore GitHub repository at https://github.com/NCAR/intake-esm-datastore/tree/master/catalogs

NCAR CMIP Analysis Platform

NCAR’s CMIP Analysis Platform (CMIP AP) includes a large collection of CMIP5 and CMIP6 data sets.

Requesting data sets

Use this form to request new data be added to the CMIP AP. Typically requests are fulfilled within two weeks. Contact CISL if you have further questions. Intake-ESM catalogs are regularly updated following the addition (or removal) of data from the platform.

Available catalogs at NCAR

NCAR has created multiple Intake ESM catalogs that work on datasets stored on GLADE. Those catalogs are listed below:

CMIP6-GLADE

  • Description: CMIP6 data accessible on the NCAR’s GLADE disk storage system

  • Platform: NCAR-GLADE

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json

  • Data Format: netCDF

  • Documentation Page: https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html

CMIP6-CESM2-Timeseries

  • Description: CESM2 raw output (non-cmorized) that went into CMIP6 data

  • Platform: NCAR-CAMPAIGN

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json

  • Data Format: netCDF

  • Documentation Page: http://www.cesm.ucar.edu/models/cesm2/

CMIP5-GLADE

  • Description: CMIP5 data accessible on the NCAR’s GLADE disk storage system

  • Platform: NCAR-GLADE

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip5.json

  • Data Format: netCDF

  • Documentation Page: https://pcmdi.llnl.gov/mips/cmip5/guide.html

CESM1-LENS-GLADE

  • Description: CESM1 Large Ensemble data stored on NCAR’s GLADE disk storage system

  • Platform: NCAR-GLADE

  • Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm1-le.json

  • Data Format: netCDF

  • Documentation Page: https://doi.org/10.5065/d6j101d1

API Reference

This page provides an auto-generated summary of intake-esm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

ESM Datastore (intake.open_esm_datastore)

class intake_esm.core.esm_datastore(*args, **kwargs)[source]

An intake plugin for parsing an ESM (Earth System Model) Collection/catalog and loading assets (netCDF files and/or Zarr stores) into xarray datasets. The in-memory representation for the catalog is a Pandas DataFrame.

Parameters
  • esmcol_obj (str, pandas.DataFrame) – If string, this must be a path or URL to an ESM collection JSON file. If pandas.DataFrame, this must be the catalog content that would otherwise be in a CSV file.

  • esmcol_data (dict, optional) – ESM collection spec information, by default None

  • progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True

  • sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’

  • csv_kwargs (dict, optional) – Additional keyword arguments passed through to the read_csv() function.

  • **kwargs – Additional keyword arguments are passed through to the Catalog base class.

Examples

At import time, this plugin is available in intake’s registry as esm_datastore and can be accessed with intake.open_esm_datastore():

>>> import intake
>>> url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
>>> col = intake.open_esm_datastore(url)
>>> col.df.head()
activity_id institution_id source_id experiment_id  ... variable_id grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1        ssp370  ...          pr         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1        ssp370  ...        prsn         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1        ssp370  ...         tas         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
3  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmax         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
4  AerChemMIP            BCC  BCC-ESM1        ssp370  ...      tasmin         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
classmethod from_df(df: pandas.core.frame.DataFrame, esmcol_data: Optional[Dict[str, Any]] = None, progressbar: bool = True, sep: str = '.', **kwargs)intake_esm.core.esm_datastore[source]

Create catalog from the given dataframe

Parameters
  • df (pandas.DataFrame) – catalog content that would otherwise be in a CSV file.

  • esmcol_data (dict, optional) – ESM collection spec information, by default None

  • progressbar (bool, optional) – Will print a progress bar to standard error (stderr) when loading assets into Dataset, by default True

  • sep (str, optional) – Delimiter to use when constructing a key for a query, by default ‘.’

Returns

esm_datastore – Catalog object

keys() → List[source]

Get keys for the catalog entries

Returns

list – keys for the catalog entries

nunique() → pandas.core.series.Series[source]

Count distinct observations across dataframe columns in the catalog.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col.nunique()
activity_id          10
institution_id       23
source_id            48
experiment_id        29
member_id            86
table_id             19
variable_id         187
grid_label            7
zstore            27437
dcpp_init_year       59
dtype: int64
search(require_all_on: Optional[Union[str, List]] = None, **query)[source]

Search for entries in the catalog.

Parameters
  • require_all_on (list, str, optional) – A dataframe column or a list of dataframe columns across which all entries must satisfy the query criteria. If None, return entries that fulfill any of the criteria specified in the query, by default None.

  • **query – keyword arguments corresponding to user’s query to execute against the dataframe.

Returns

cat (esm_datastore) – A new Catalog with a subset of the entries in this Catalog.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col.df.head(3)
activity_id institution_id source_id  ... grid_label                                             zstore dcpp_init_year
0  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
1  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
2  AerChemMIP            BCC  BCC-ESM1  ...         gn  gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...            NaN
>>> cat = col.search(
...     source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"],
...     experiment_id=["historical", "ssp585"],
...     variable_id="pr",
...     table_id="Amon",
...     grid_label="gn",
... )
>>> cat.df.head(3)
    activity_id institution_id    source_id  ... grid_label                                             zstore dcpp_init_year
260        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i...            NaN
346        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r2i...            NaN
401        CMIP            BCC  BCC-CSM2-MR  ...         gn  gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r3i...            NaN

The search method also accepts compiled regular expression objects from compile() as patterns.

>>> import re
>>> # Let's search for variables containing "Frac" in their name
>>> pat = re.compile(r"Frac")  # Define a regular expression
>>> cat.search(variable_id=pat)
>>> cat.df.head().variable_id
0     residualFrac
1    landCoverFrac
2    landCoverFrac
3     residualFrac
4    landCoverFrac
serialize(name: str, directory: Optional[str] = None, catalog_type: str = 'dict')None[source]

Serialize collection/catalog to corresponding json and csv files.

Parameters
  • name (str) – name to use when creating ESM collection json file and csv catalog.

  • directory (str, PathLike, default None) – The path to the local directory. If None, use the current directory

  • catalog_type (str, default 'dict') – Whether to save the catalog table as a dictionary in the JSON file or as a separate CSV file.

Notes

Large catalogs can result in large JSON files. To keep the JSON file size manageable, call with catalog_type=’file’ to save catalog as a separate CSV file.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> col_subset = col.search(
...     source_id="BCC-ESM1",
...     grid_label="gn",
...     table_id="Amon",
...     experiment_id="historical",
... )
>>> col_subset.serialize(name="cmip6_bcc_esm1", catalog_type="file")
Writing csv catalog to: cmip6_bcc_esm1.csv.gz
Writing ESM collection json file to: cmip6_bcc_esm1.json
to_dataset_dict(zarr_kwargs: Optional[Dict[str, Any]] = None, cdf_kwargs: Optional[Dict[str, Any]] = None, preprocess: Optional[Dict[str, Any]] = None, storage_options: Optional[Dict[str, Any]] = None, progressbar: Optional[bool] = None, aggregate: Optional[bool] = None) → Dict[str, xarray.core.dataset.Dataset][source]

Load catalog entries into a dictionary of xarray datasets.

Parameters
  • zarr_kwargs (dict) – Keyword arguments to pass to open_zarr() function

  • cdf_kwargs (dict) – Keyword arguments to pass to open_dataset() function. If specifying chunks, the chunking is applied to each netcdf file. Therefore, chunks must refer to dimensions that are present in each netcdf file, or chunking will fail.

  • preprocess (callable, optional) – If provided, call this function on each dataset prior to aggregation.

  • storage_options (dict, optional) – Parameters passed to the backend file-system such as Google Cloud Storage, Amazon Web Service S3.

  • progressbar (bool) – If True, will print a progress bar to standard error (stderr) when loading assets into Dataset.

  • aggregate (bool, optional) – If False, no aggregation will be done.

Returns

dsets (dict) – A dictionary of xarray Dataset.

Examples

>>> import intake
>>> col = intake.open_esm_datastore("glade-cmip6.json")
>>> cat = col.search(
...     source_id=["BCC-CSM2-MR", "CNRM-CM6-1", "CNRM-ESM2-1"],
...     experiment_id=["historical", "ssp585"],
...     variable_id="pr",
...     table_id="Amon",
...     grid_label="gn",
... )
>>> dsets = cat.to_dataset_dict()
>>> dsets.keys()
dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn'])
>>> dsets["CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn"]
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, member_id: 3, time: 1980)
Coordinates:
* lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
* lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id  (member_id) <U8 'r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(160, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(320, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 600, 160, 320), meta=np.ndarray>
unique(columns: Optional[Union[str, List]] = None) → Dict[str, Any][source]

Return unique values for given columns in the catalog.

Parameters

columns (str, list) – name of columns for which to get unique values

Returns

info (dict) – dictionary containing count, and unique values

Examples

>>> import intake
>>> import pprint
>>> col = intake.open_esm_datastore("pangeo-cmip6.json")
>>> uniques = col.unique(columns=["activity_id", "source_id"])
>>> pprint.pprint(uniques)
{'activity_id': {'count': 10,
                'values': ['AerChemMIP',
                            'C4MIP',
                            'CMIP',
                            'DAMIP',
                            'DCPP',
                            'HighResMIP',
                            'LUMIP',
                            'OMIP',
                            'PMIP',
                            'ScenarioMIP']},
'source_id': {'count': 17,
            'values': ['BCC-ESM1',
                        'CNRM-ESM2-1',
                        'E3SM-1-0',
                        'MIROC6',
                        'HadGEM3-GC31-LL',
                        'MRI-ESM2-0',
                        'GISS-E2-1-G-CC',
                        'CESM2-WACCM',
                        'NorCPM1',
                        'GFDL-AM4',
                        'GFDL-CM4',
                        'NESM3',
                        'ECMWF-IFS-LR',
                        'IPSL-CM6A-ATM-HR',
                        'NICAM16-7S',
                        'GFDL-CM4C192',
                        'MPI-ESM1-2-HR']}}
update_aggregation(attribute_name: str, agg_type: Optional[str] = None, options: Optional[dict] = None, delete=False)[source]

Updates aggregation operations info.

Parameters
  • attribute_name (str) – Name of attribute (column) across which to aggregate.

  • agg_type (str, optional) – Type of aggregation operation to apply. Valid values include: join_new, join_existing, union, by default None

  • options (dict, optional) – Aggregration settings that are passed as keywords arguments to concat() or merge(). For join_existing, it must contain the name of the existing dimension to use (for e.g.: something like {‘dim’: ‘time’})., by default None

  • delete (bool, optional) – Whether to delete/remove/disable aggregation operations for a particular attribute, by default False

property agg_columns

List of columns used to merge/concatenate compatible multiple Dataset into a single Dataset.

property data_format

The data format. Valid values are netcdf and zarr. If specified, it means that all data assets in the catalog use the same data format.

property df

Return pandas DataFrame.

property format_column_name

Name of the column which contains the data format.

property groupby_attrs

Dataframe columns used to determine groups of compatible datasets.

Returns

list – Columns used to determine groups of compatible datasets.

property key_template

Return string template used to create catalog entry keys

Returns

str – string template used to create catalog entry keys

property path_column_name

The name of the column containing the path to the asset.

property variable_column_name

Name of the column that contains the variable name.

Contribution Guide

Interested in helping build intake-esm? Have code from your work that you believe others will find useful? Have a few minutes to tackle an issue?

Contributions are highly welcomed and appreciated. Every little help counts, so do not hesitate!

The following sections cover some general guidelines regarding development in intake-esm for maintainers and contributors. Nothing here is set in stone and can’t be changed. Feel free to suggest improvements or changes in the workflow.

Feature requests and feedback

We’d also like to hear about your propositions and suggestions. Feel free to submit them as issues on intake-esm’s GitHub issue tracker and:

  • Explain in detail how they should work.

  • Keep the scope as narrow as possible. This will make it easier to implement.

Report bugs

Report bugs for intake-esm in the issue tracker.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting, specifically the Python interpreter version, installed libraries, and intake-esm version.

  • Detailed steps to reproduce the bug.

If you can write a demonstration test that currently fails but should pass (xfail), that is a very useful commit to make as well, even if you cannot fix the bug itself.

Fix bugs

Look through the GitHub issues for bugs.

Talk to developers to find out how you can fix specific bugs.

Write documentation

intake-esm could always use more documentation. What exactly is needed?

  • More complementary documentation. Have you perhaps found something unclear?

  • Docstrings. There can never be too many of them.

  • Blog posts, articles and such – they’re all very appreciated.

You can also edit documentation files directly in the GitHub web interface, without using a local copy. This can be convenient for small fixes.

Build the documentation locally with the following command:

$ make docs

Preparing Pull Requests

  1. Fork the intake-esm GitHub repository.

  2. Clone your fork locally using git, connect your repository to the upstream (main project), and create a branch::

    $ git clone git@github.com:YOUR_GITHUB_USERNAME/intake-esm.git
    $ cd intake-esm
    $ git remote add upstream git@github.com:intake/intake-esm.git
    

    now, to fix a bug or add feature create your own branch off “master”:

    $ git checkout -b your-bugfix-feature-branch-name master
    

    If you need some help with Git, follow this quick start guide: https://git.wiki.kernel.org/index.php/QuickStart

  3. Install dependencies into a new conda environment::

    $ conda env update -f ci/environment.yml
    $ conda activate intake-esm-dev
    
  4. Make an editable install of intake-esm by running::

    $ python -m pip install -e .
    
  5. Install pre-commit <https://pre-commit.com>_ hooks on the intake-esm repo::

    $ pre-commit install
    

    Afterwards pre-commit will run whenever you commit.

    pre-commit is a framework for managing and maintaining multi-language pre-commit hooks to ensure code-style and code formatting is consistent.

    Now you have an environment called intake-esm-dev that you can work in. You’ll need to make sure to activate that environment next time you want to use it after closing the terminal or your system.

  6. (Optional) Run all the tests

    Now running tests is as simple as issuing this command::

    $ pytest --cov=./
    

    This command will run tests via the pytest tool.

  7. Commit and push once your tests pass and you are happy with your change(s)::

    When committing, pre-commit will re-format the files if necessary.

    $ git commit -a -m "<commit message>"
    $ git push -u
    
  8. Finally, submit a pull request through the GitHub website using this data::

    head-fork: YOUR_GITHUB_USERNAME/intake-esm
    compare: your-branch-name
    
    base-fork: intake/intake-esm
    base: master # if it's a bugfix or feature
    

Changelog

Intake-esm v2020.11.4

Features
Breaking Changes
Bug Fixes
Documentation
Internal Changes

Intake-esm v2020.8.15

Features
Documentation
Internal Changes
Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2020.6.11

Features
Documentation
Internal Changes
Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2020.5.21

Features

Intake-esm v2020.5.01

Features
Bug Fixes
  • Revert back to using concurrent.futures to address failures due to dask’s distributed scheduler. (GH#225) & (GH#226)

Internal Changes
Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2020.3.16

Features
Bug Fixes
Internal Changes
Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2019.12.13

Features
Bug Fixes
  • Remove the caching option (GH#158) @matt-long

  • Preserve encoding when aggregating datasets (GH#161) @matt-long

  • Sort aggregations to make sure {py:func}:~intake_esm.merge_util.join_existing is always done before {py:func}:~intake_esm.merge_util.join_new (GH#171) @andersy005

Documentation
Internal Changes
Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2019.10.15

Features
Breaking changes
  • Replaced {py:class}:~intake_esm.core.esm_metadatastore with {py:class}:~intake_esm.core.esm_datastore, see the API reference for more details.

  • intake-esm won’t build collection catalogs anymore. intake-esm now expects an ESM collection JSON file as input. This JSON should conform to the Earth System Model Collection specification.

Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2019.8.23

Features
  • Add mistral data holdings to intake-esm-datastore (GH#133) @aaronspring

  • Add support for NA-CORDEX data holdings. (GH#115) @jukent

  • Replace .csv with netCDF as serialization format when saving the built collection to disk. With netCDF, we can record very useful information into the global attributes of the netCDF dataset. (GH#119) @andersy005

  • Add string representation of ESMMetadataStoreCatalog`` object ({pr}122`) @andersy005

  • Automatically build missing collections by calling esm_metadatastore(collection_name="GLADE-CMIP5"). When the specified collection is part of the curated collections in intake-esm-datastore. (GH#124) @andersy005

    
    In [1]: import intake
    
    In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP5")
    
    In [3]: # if "GLADE-CMIP5" collection isn't built already, the above is equivalent to:
    
    In [4]: col = intake.open_esm_metadatastore(collection_input_definition="GLADE-CMIP5")
    
  • Revert back to using official DRS attributes when building CMIP5 and CMIP6 collections. (GH#126) @andersy005

  • Add .df property for interfacing with the built collection via dataframe To maintain backwards compatiblity. (GH#127) @andersy005

  • Add unique() and nunique() methods for summarizing count and unique values in a collection. (GH#128) @andersy005

    
    In [1]: import intake
    
    In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP5")
    
    In [3]: col
    Out[3]: GLADE-CMIP5 collection catalogue with 615853 entries: > 3 resource(s)
    
              > 1 resource_type(s)
    
              > 1 direct_access(s)
    
              > 1 activity(s)
    
              > 218 ensemble_member(s)
    
              > 51 experiment(s)
    
              > 312093 file_basename(s)
    
              > 615853 file_fullpath(s)
    
              > 6 frequency(s)
    
              > 25 institute(s)
    
              > 15 mip_table(s)
    
              > 53 model(s)
    
              > 7 modeling_realm(s)
    
              > 3 product(s)
    
              > 9121 temporal_subset(s)
    
              > 454 variable(s)
    
              > 489 version(s)
    
    In[4]: col.nunique()
    
    resource 3
    resource_type 1
    direct_access 1
    activity 1
    ensemble_member 218
    experiment 51
    file_basename 312093
    file_fullpath 615853
    frequency 6
    institute 25
    mip_table 15
    model 53
    modeling_realm 7
    product 3
    temporal_subset 9121
    variable 454
    version 489
    dtype: int64
    
    In[4]: col.unique(columns=['frequency', 'modeling_realm'])
    
    {'frequency': {'count': 6, 'values': ['mon', 'day', '6hr', 'yr', '3hr', 'fx']},
    'modeling_realm': {'count': 7, 'values': ['atmos', 'land', 'ocean', 'seaIce', 'ocnBgchem',
    'landIce', 'aerosol']}}
    
    
Bug Fixes
  • For CMIP6, extract grid_label from directory path instead of file name. (GH#127) @andersy005

Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2019.8.5

Features
  • Support building collections using inputs from intake-esm-datastore repository. (GH#79) @andersy005

  • Ensure that requested files are available locally before loading data into xarray datasets. (GH#82) @andersy005 and @matt-long

  • Split collection definitions out of config. (GH#83) @matt-long

  • Add intake-esm-builder, a CLI tool for building collection from the command line. (GH#89) @andersy005

  • Add support for CESM-LENS data holdings residing in AWS S3. (GH#98) @andersy005

  • Sort collection upon creation according to order-by-columns, pass urlpath through stack for use in parsing collection filenames (GH#100) @pbranson

Bug Fixes
Internal Changes
  • Refactor existing functionality to make intake-esm robust and extensible. (GH#77) @andersy005

  • Add aggregate._override_coords function to override dim coordinates except time in case there’s floating point precision difference. (GH#108) @andersy005

  • Fix CESM-LE ice component peculiarities that caused intake-esm to load data improperly. The fix separates variables for ice component into two separate components:

    • ice_sh: for southern hemisphere

    • ice_nh: for northern hemisphere

    (GH#114) @andersy005

Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2019.5.11

Features
Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2019.4.26

Features
Bug Fixes
Contributors to this release

(GitHub contributors page for this release)

Intake-esm v2019.2.28

Features
Bug Fixes
  • Fix bug on build catalog and move exclude_dirs to locations (GH#33) @matt-long

Internal Changes