Manifest format (datamanifest.toml)¶
This page describes the structure of the manifest file — the fields you write by
hand or that the CLI writes for you. It is a practical reference for this Python
implementation. The format is shared across languages and defined normatively in
its own specification: the
datamanifest.toml spec. For the
cross-tool details — which features this implementation supports and which spec
version it targets — see Conformance.
The canonical filename is datamanifest.toml; datasets.toml and
Datasets.toml are also recognized when walking up from the working directory.
Top-level layout¶
A manifest is a TOML document containing:
- one table per dataset, keyed by the dataset name — the common case;
[_LANG.python.loaders]— an optionalformat → loadermap of project-wide default loaders (see Language bindings);[_STORAGE]— optional configuration of where data lives on disk (see Storage model);[_META]— optional schema marker (schema = 1).
Keys beginning with _ are structural — they are not datasets. Unknown _*
tables (and any foreign-language _LANG.* subtrees) are preserved verbatim on
write, so a manifest shared with another tool round-trips losslessly.
A minimal manifest is just dataset tables:
[co2]
uri = "https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv"
sha256 = "…"
format = "csv"
Dataset fields¶
Every field is optional. Types are TOML types; defaults are the empty
string / empty list / false unless noted.
| Field | Type | Default | Meaning |
|---|---|---|---|
uri |
string | "" |
Single source URI: HTTP(S), git/*.git, ssh/sshfs/rsync, object store (s3://, gs://, …), or file://. Mutually exclusive with uris. |
uris |
array of string | [] |
Several source URIs materialized into one dataset folder. Mutually exclusive with uri. |
sha256 |
string | "" |
Expected SHA-256 of the download. Auto-filled on first successful fetch and verified at fetch time. |
format |
string | "" |
Format hint used to pick a loader (csv, nc, parquet, json, zip, tar.gz, …). Inferred from the URI when absent. |
version |
string | "" |
Dataset version; participates in the storage key, so multiple versions coexist on disk. |
branch |
string | "" |
For git sources: the branch or tag to clone. |
doi |
string | "" |
DOI of the dataset; also usable as a lookup/search key. |
extract |
bool | false |
After download, unpack the archive (zip / tar / tar.gz) and use the extracted directory as the dataset path. |
aliases |
array of string | [] |
Alternative names the dataset can be looked up by. |
description |
string | "" |
Human-readable description. |
requires |
array of string | [] |
Names of datasets fetched before this one; defines a dependency graph resolved in topological order. |
skip_download |
bool | false |
Treat the dataset as externally provided — return the uri/path and attempt no download. |
skip_checksum |
bool | false |
Disable checksum verification for this dataset. |
storage_path |
string | "" |
Per-dataset location override (a path expression; default $datasets_dir/$key). See Storage model. |
delegate |
bool | true |
Allow cross-language fetch delegation to a peer datamanifest CLI. |
host, path, and scheme are derived from the URI; the tools omit them on
write.
Language bindings¶
By default a dataset is downloaded from its uri and loaded by the built-in
loader for its format. Both steps can be overridden with module:function
references (never inline code):
- bare
fetcher— a Pythonmodule:functioncalled to produce the bytes instead of (or in addition to) downloading theuri; - bare
loader— a Pythonmodule:functioncalled to load the dataset into memory, overriding the format default; - bare
shell— a command template run to fetch the data (with$download_path,$uri,$key, … substitutions); language-agnostic, no loader.
[regridded]
format = "nc"
fetcher = "mypkg.build:regrid" # produce the bytes in Python
loader = "mypkg.io:open_regridded" # override the default .nc loader
A binding may instead be a { ref, args, kwargs } table so one function is reused
across datasets that differ only in arguments, with $var substitution in string
values:
Bindings can also be written per-language under [<dataset>._LANG.<lang>], and
project-wide format defaults under [_LANG.python.loaders]. The bare forms above
are the common single-language case; the
Language bindings page covers the full _LANG namespace,
the loaders map, and the fetch/load resolution ladders.
Storage¶
Where data lives is configured in [_STORAGE] (two folder fields,
datasets_dir and datacache_dir, both repo-local by default) and per dataset
via storage_path. This is its own topic — see the
Storage model page.
A full example¶
The manifest below declares four real datasets fetched from PANGAEA, a CMIP6 output over SSH, and a GitHub release archive:
[CMIP6_lgm_tos]
uri = "ssh://albedo1.dmawi.de:/albedo/work/projects/p_forclima/preproc_data_esmvaltool/LGM/recipe_cmip6_lgm_tos_20241114_151009/preproc/lgm/tos_CLIM"
sha256 = "7f28454b5c399226f923be60393ecbb2983c15538f9733864e5ce0f9f4b83601"
[herzschuh2023]
uri = "https://doi.pangaea.de/10.1594/PANGAEA.930512?format=zip"
format = "zip"
sha256 = "4e40e43ac0f1ddea125cb5314eee46e332aacbcb18aff7efbf59f1d8b1d84a13"
doi = "10.1594/PANGAEA.930512"
extract = true
[jonkers2024]
uri = "https://download.pangaea.de/dataset/962852/files/LGM_foraminifera_assemblages_20240110.csv"
sha256 = "35ee6e8b94dc355973276da609fb18846ee1156d0fa848c1a9b221edd3715513"
doi = "10.1594/PANGAEA.962852"
["jesstierney/lgmDA"]
uri = "https://github.com/jesstierney/lgmDA/archive/refs/tags/v2.1.zip"
sha256 = "da5f85235baf7f858f1b52ed73405f5d4ed28a8f6da92e16070f86b724d8bb25"
extract = true
The normative specification¶
This page is a practical summary, not the contract. The authoritative,
cross-language definition — the full field list, the resolution ladders, the
preservation rules, and the conformance fixtures — lives in the
datamanifest.toml specification.