Declaring datasets¶
A dataset table holds language-agnostic contract fields. All are optional; the common ones:
| Field | Meaning |
|---|---|
uri |
Source to download (https, git/GitHub, ssh, an object store s3:///gs:///az:///…, …). uris for mirrors. |
sha256 |
Expected digest; auto-filled on first download, verified at fetch. |
format |
Format hint (csv, nc, parquet, zip, …) that picks a default loader; inferred from the URI when absent. |
extract |
After download, unpack the archive and use the extracted directory as the path. |
doi |
DOI of the dataset (also a lookup key). |
aliases |
Alternative names to look the dataset up by. |
version |
Dataset version; part of the storage key, so versions coexist on disk. |
requires |
Names of datasets to fetch first (a dependency graph, resolved in order). |
description |
Human-readable note (replaces TOML comments). |
storage_path |
Where the dataset lives on disk (overrides the default $datasets_dir/$key) — see Storage. |
skip_checksum |
Disable checksum verification. |
skip_download |
Management mode: a passive dependency you maintain yourself — not downloaded, not verified, never touched by maintenance (e.g. a large shared archive). |
lazy_access |
Access mode: open the uri in place via a loader (e.g. a remote object store) — no local copy, no checksum, no record; a loader is required. Mechanism (stream/mount) is implementation-defined. |
fetcher / loader / shell |
How to obtain/load it — see Language bindings. |
# A DOI archive: downloaded, checksum-verified, then unpacked.
[herzschuh2023]
uri = "https://doi.pangaea.de/10.1594/PANGAEA.930512?format=zip"
sha256 = "4e40e43ac0f1ddea125cb5314eee46e332aacbcb18aff7efbf59f1d8b1d84a13"
doi = "10.1594/PANGAEA.930512"
format = "zip"
extract = true
description = "Pollen-based climate reconstructions (Herzschuh et al., 2023)"
A dataset key may contain a slash if quoted: ["jesstierney/lgmDA"].
Normative: SCHEMA.md §Language-agnostic contract.