datamanifest.toml — manifest schema specification¶
This document is the normative description of the TOML manifest format shared by the
DataManifest.jl (Julia) and
datamanifest (Python) tools. A manifest
declares the data dependencies of a project: each dataset's source URI, checksum,
version, format, and how to fetch and load it. Either implementation can read and write a
conforming file; each reads the language-agnostic contract fields plus its own
_LANG-namespaced bindings, and preserves the rest verbatim.
Versioning¶
Two independent version axes govern this format:
_META.schema(integer, stored inside the file) is the data-model compatibility version. It increments only on a breaking structural change. A file without[_META]is treated as schema v0 (legacy flat) and read leniently. Current value: 1.- Spec-document version (a git tag such as
spec-v1.0) versions the prose, examples, and fixture suite. An implementation conforming to "schema 1, spec ≥ v1.0" pins to a spec tag; the spec may advance without retroactively breaking a pinned implementation. The spec is never forked per language package: one normative document, one fixture suite, multiple implementations at varying capability levels.
Structural keys¶
Keys beginning with _ are structural — they are not dataset tables. At the top
level, the defined structural tables are _META, _LANG, _STORAGE, and _LOADERS.
Within a dataset table, the only defined structural sub-table is _LANG. Readers MUST
preserve unknown _* keys verbatim and MUST NOT treat them as datasets or drop them on
write.
Top-level layout¶
A v1 manifest is a TOML document with:
[_META]— schema metadata (schema = 1).[_LANG.<lang>]— project-wide execution-context configuration for language<lang>. The sub-keyloadersis aformat → refmap of default loaders for that language.- One table per dataset, keyed by the dataset name. Dataset tables hold the
language-agnostic contract fields, an optional
_LANGsub-table for per-dataset bindings, and optional bare (language-implicit)fetcher/loaderbindings (see Language-implicit bindings). [_STORAGE]— optional storage configuration: the two folder fields (datasets_dir/datacache_dir), optional read-pool lists (datasets_pools/datacache_pools), reusable$-symbols (predefined$user_data_dir/$user_cache_dir/$repoplus user-defined keys), and the_HOSThost-override sub-table. See Storage.[_LOADERS]— a language-implicitformat → bindingloaders map (tolerated; the bare counterpart of[_LANG.<self>.loaders]). See Language-implicit bindings.
Example:
[_META]
schema = 1
[_LANG.python.loaders]
csv = "pandas.io.parsers:read_csv"
nc = "xarray:open_dataset"
[_LANG.julia.loaders]
csv = "CSV:read"
nc = "NCDatasets:Dataset"
[foo]
uri = "https://example.com/foo.csv"
sha256 = "abc123"
format = "csv"
[bar]
sha256 = "def456"
format = "nc"
shell = "make-bar -o $download_path" # language-agnostic shell fetcher
[bar._LANG.julia]
fetcher = "MyPkg:build_bar"
loader = "MyPkg:load_bar"
[bar._LANG.python]
fetcher = "mypkg.build:bar"
loader = "mypkg.load:bar"
Language-agnostic contract (common fields)¶
Every field is optional and defaults to the empty string / empty list / false shown.
Types are TOML types (string, array of string, bool).
| Field | Type | Default | Semantics |
|---|---|---|---|
uri |
string | "" |
Single source URI. HTTP(S), git/ssh+git/*.git, ssh/sshfs/rsync, file://, or an object-store scheme (s3://, gs://, gcs://, az://, abfs://, abfss://, adl://, gdrive:// — see Download schemes). Mutually exclusive with uris. |
uris |
array of string | [] |
Batch of source URIs written into a single dataset folder under disambiguated relative paths. Mutually exclusive with uri. |
host |
string | "" |
Parsed from the URI (derived; tools omit it on write). |
path |
string | "" |
Parsed from the URI (derived; tools omit it on write). |
scheme |
string | "" |
Parsed from the URI (derived; tools omit it on write). |
version |
string | "" |
Dataset version; participates in the storage key so multiple versions coexist on disk. |
branch |
string | "" |
For git sources: branch/tag to clone (--branch). |
doi |
string | "" |
DOI of the dataset; also usable as a search key. |
aliases |
array of string | [] |
Alternative names this dataset can be looked up by. |
description |
string | "" |
Human-readable description (replaces TOML comments). |
key |
string | "" |
Storage key (relative path under the datasets folder). Derived from host + path + version when absent. |
storage_path |
string | $datasets_dir/$key |
Path expression for where this dataset lives on disk, overriding the default. May interpolate $-symbols ($datasets_dir, $key, $user_data_dir, $scratch, …), $USER/env, and ~; relative ⇒ resolved against the project root. Containing $key ⇒ a tool-managed keyed location; an exact path without $key ⇒ a user-managed location used verbatim that maintenance never touches. Generalizes the former local_path and subsumes the former store. See Storage. Honored under the storage capability; other tools preserve it verbatim. |
sha256 |
string | "" |
Expected SHA-256 of the downloaded file/folder. Auto-filled on first successful download and verified at fetch time; not re-verified on every load (re-verification is opt-in). |
skip_checksum |
bool | false |
Disable checksum verification for this dataset. |
skip_download |
bool | false |
Management mode — treat the dataset as a passive, externally-managed dependency: it is not downloaded, not checksum-verified, and never moved or deleted by maintenance; the documented uri/path is returned as-is. For data the user provides and maintains (e.g. a large shared archive that should not be fetched over the network). Distinct from lazy_access — this is about who manages the bytes, not how they are read. |
lazy_access |
bool | false |
Access mode — access the dataset in place instead of materializing a local copy: the uri is handed to a loader that opens it where it lives (typically a remote object store), with no local copy, no checksum, and no state-file record. Requires a loader (a bare lazy_access with no loader is an error). The access mechanism (streaming, mount, FUSE, …) is implementation-defined — the spec fixes only that the bytes are not materialized. Distinct from skip_download (a management mode); the two are independent and not meant to combine. |
delegate |
bool | (run default) | Force the cross-language fetch rung (rung 3) on (true) or off (false) for this dataset. When omitted, the tool's run-level default applies (--delegate / configuration). Honored under the delegation capability; other tools preserve it verbatim. See Cross-language fetch. |
extract |
bool | false |
After download, extract the archive (zip / tar / tar.gz) and use the extracted directory as the dataset path. |
format |
string | "" |
Data format hint used to pick a default loader (csv, parquet, nc, json, yaml, toml, md, txt, zip, tar, tar.gz, …). Inferred from the URI when absent. |
requires |
array of string | [] |
Names of datasets that must be downloaded before this one; defines a dependency graph resolved in topological order. |
fetcher |
string | table | "" |
Language-implicit fetcher binding — read as the running tool's own language (see Language-implicit bindings). Equivalent to [<dataset>._LANG.<self>].fetcher. |
loader |
string | table | "" |
Language-implicit loader binding — read as the running tool's own language. Equivalent to [<dataset>._LANG.<self>].loader. |
shell |
string | "" |
Language-agnostic shell fetcher — a command template run as a subprocess (the same command for every tool). Fetcher only; see shell fetcher. |
Identifier resolution is exact-or-error. A dataset is looked up by its name, an
alias, or its doi. When an operation must resolve to a single dataset, an
identifier matching more than one dataset is a fail-loud error that names the
candidates — never a silent first-match. This matters because a doi may be shared by
several datasets (e.g. one archive split into parts), and acting on an arbitrary one of N
is a correctness footgun. (The same rule governs sync addressing, where an ambiguous id
requires an explicit --batch; see Cross-machine sync.)
Language bindings (_LANG)¶
Executable bindings live under a structural _LANG namespace, keyed by language tag
(python, julia, r, …). The dataset table itself stays fully agnostic. (The
language-agnostic shell fetcher is a bare shell field, not a _LANG tag — see
shell fetcher.)
All executable references are module:function references — never inline code, in
any language. A local module is importable because the manifest's directory (the project
root) is on the language tool's import path by convention. There are no includes or
modules fields in v1. A binding may additionally carry arguments as data
(args / kwargs); these are passed to the referenced function and are never
interpreted as code (see Binding forms).
Binding forms (string or table)¶
A binding is the single, unified concept used at every executable site — a
per-dataset fetcher or loader ([<dataset>._LANG.<lang>]) and every entry in a
project-wide [_LANG.<lang>.loaders] format map. It takes one of two interchangeable
forms:
- string — a bare
module:functionreference; or - table —
{ ref = "module:function", args = [...], kwargs = {...} }, withargs/kwargsoptional (see Parameterized bindings).
The string is an alias for the ref-only table — "M:f" ≡ { ref = "M:f" } — so a
reader MUST accept either form anywhere a binding is allowed. Call semantics follow the
arguments, not the syntax: with no args/kwargs the tool makes its conventional
call (a loader receives the dataset path; a fetcher the standard fetch context); with
args/kwargs the call is explicit — ref(*args; kwargs...), nothing auto-injected
— and runtime values are passed via $var substitution ($path, …).
Canonical writing. A binding with no args and no kwargs MUST be written as the
string; the bare { ref = … } table is accepted on read but normalized to the string
on write. A binding that carries args/kwargs is written as a table.
The shell fetcher is not a module:function binding — it is a language-agnostic
command-template string (see shell fetcher) — so it is always a string, never a table.
Per-dataset bindings¶
[<dataset>._LANG.<lang>] holds singular bindings for a specific dataset in language
<lang>. Both keys are optional, and each is a binding in either form (see Binding
forms).
| Key | Type | Semantics |
|---|---|---|
fetcher |
string | table | module:function ref (or { ref, args } table) called to produce the dataset bytes, instead of (or in addition to) downloading the uri. |
loader |
string | table | module:function ref (or { ref, args } table) called to load the dataset into memory, overriding the format default. |
Parameterized bindings (ref + args / kwargs)¶
A binding may be written as a table so one function is reused across datasets that differ
only in arguments — the same loader called with grid = "5x5" for one dataset and
grid = "10x10" for another:
[esm_5x5._LANG.julia.loader]
ref = "MyPkg:load_esm"
args = ["$path"] # positional, in order
kwargs = { grid = "5x5", skip_models = ["CESM.*"] } # keyword
[esm_10x10._LANG.julia.loader]
ref = "MyPkg:load_esm"
args = ["$path"]
kwargs = { grid = "10x10" }
| Key | Type | Semantics |
|---|---|---|
ref |
string | The module:function reference (required). |
args |
array | Positional arguments, in order (optional). |
kwargs |
table | Keyword arguments (optional). |
argsandkwargsare plain data (string, number, bool, array, table) — never code.argsis an ordered list of positional values;kwargskeys become keyword parameters. Values map to each language's native types.- A binding carrying
args/kwargsis called explicitly: the tool callsref(*args; kwargs...)and does not auto-inject any standard value. Runtime values are referenced by$varsubstitution in string values — the same variables theshellfetcher exposes ($key,$version,$doi,$format,$branch,$uri,$project_root;$download_pathfor fetchers,$path— the resolved dataset path — for loaders). The ref-only form (the bare string, equivalently{ ref = … }) instead makes the tool's conventional call and is written as the string (see Binding forms). - Type mapping is language-neutral. A value with no TOML type — e.g. a Julia
Symbol— is written as its plain string form (weighting_method = "model"for:model); the target function accepts the string (or coerces it at its boundary). A binding's arguments MUST be representable as TOML data. - A tool that executes
<lang>bindings but does not implement thebinding-argscapability MUST error when it encountersargs/kwargs, rather than silently calling the function without them (which would change results). The bare-string form requires no such capability. - For canonical serialization,
kwargskeys are emitted in lexicographic order like all other keys (including inside an inline{ }table);argsis an ordered array, so its element order is preserved as data (arrays are never reordered). Both therefore carry the same key order and element order across tools — semantically identical (and byte-identical via the canonical reference form; see thebyte-identitycapability).
shell fetcher¶
shell is a bare, language-agnostic dataset field: a command template run as a
subprocess to fetch the dataset. Unlike a bare fetcher/loader (language-implicit —
the running tool's own language), shell is the same command for every tool, so it
belongs on the dataset table, not under the language namespace. It is a fetcher only — a
subprocess cannot return a live in-memory object, so there is no shell loader. The value
is a command template supporting variable substitutions: $download_path, $project_root,
$uri, $key, $version, $doi, $format, $branch, $path_<ref>, $path_<i>,
$requires_paths.
The bare shell field is the canonical (and only) form; the former
[<dataset>._LANG.shell].fetcher is not part of the spec.
Project-wide loaders¶
[_LANG.<lang>.loaders] is a format → binding map of project-wide default loaders for
language <lang>: each value is a binding in either form (a bare module:function
string, or a { ref, args, kwargs } table — see Binding forms), so a format default may
be parameterized exactly like a per-dataset loader. It applies when a dataset has no
per-dataset loader for that language. Note the singular loader key per dataset vs. the
plural loaders format map at the top level.
Language-implicit bindings (bare fetcher / loader)¶
For a single-language project the [<dataset>._LANG.<lang>] wrapper is needless ceremony.
A dataset table MAY therefore carry a bare fetcher and/or loader directly (a
binding in either form), and a top-level [_LOADERS] table MAY carry a bare
format → binding map. "Bare" means language-implicit: a reading tool interprets these
as bindings in its own language, exactly as if they appeared under
[<dataset>._LANG.<self>] / [_LANG.<self>.loaders].
- Precedence — explicit wins. An explicit own-language binding overrides the bare one:
[<dataset>._LANG.<self>].loader> bareloader, and[_LANG.<self>.loaders][fmt]>[_LOADERS][fmt](likewise forfetcher). - Strict — fail loud. A bare binding is present for the running language (bare =
the running language), so it is treated exactly like an explicit
[<dataset>._LANG.<self>]binding: if it fails to resolve it is an error, and if it resolves and then raises at run time the error propagates — never a silent fall-through to a different loader/fetcher (which could hand a program wrong-shaped data behind only a warning). The ladder falls through only for bindings that are absent for the running language. A manifest meant to be read by more than one language uses explicit[<dataset>._LANG.<lang>]bindings (absent — and so correctly skipped — in the other languages); sharing a bare binding across languages and expecting the others to ignore it is unsupported. (A tool-wide best-effort mode — e.g. "fetch everything that succeeds, skip the rest" — is a separate concern, out of scope for this rule and not introduced here.) - Preserve verbatim (round-trip). A writer MUST keep a bare binding bare — it MUST
NOT promote
loader = …into[<dataset>._LANG.<self>].loader. A tool writes under_LANG.<self>only for bindings it generates itself, so hand-authored bare bindings survive a read-write round-trip unchanged and one tool never rewrites another language's view.
[_LOADERS] was previously a deprecated back-compat table; it is now tolerated as the
language-implicit counterpart of [_LANG.<self>.loaders] — read as the running tool's
format-default loaders and preserved verbatim on write. The shell fetcher is the
language-agnostic sibling of these language-implicit bindings: a bare dataset field
carrying the same command for every tool (see shell fetcher).
Resolution semantics¶
At runtime, each language tool collapses the _LANG tree to a single effective
fetcher and loader for each dataset. The full _LANG tree is retained
internally for lossless round-trip.
Fetch ladder¶
The tool tries each rung in order, using the first that applies:
[<dataset>._LANG.<self>].fetcher, else the bare[<dataset>].fetcher— in-process call (own language, fastest);- the dataset's
shellcommand — run the command template (cheap subprocess); - cross-language fetch — the rare case: run a fetcher defined in another language
(mechanism implementation-defined; the Python CLI can serve as a fallback), controlled
by
delegate/--delegate; see Cross-language fetch below; - plain
uridownload (ifuriis set) — dispatched by scheme (see Download schemes); - else error.
Download schemes¶
The plain-uri rung dispatches on the URI scheme. The spec fixes the scheme set and
its semantics — "fetch the named object, then verify sha256 as usual" — but not the
mechanism: each implementation fetches with whatever backend fits the language.
| Scheme(s) | Fetch |
|---|---|
http / https |
streaming GET |
git / ssh+git / https://*.git |
shallow clone (--branch honors branch) |
ssh / sshfs / rsync |
rsync over SSH |
file:// |
copy (or rsync from a remote host) |
object stores — s3://, gs://, gcs://, az://, abfs://, abfss://, adl://, gdrive:// |
fetch the object from the named store, then verify sha256 |
- Object-store schemes are normative, but mechanism-agnostic: a tool MAY implement them
with any backend (the Python tool uses
fsspecbehind an optional extra; a peer tool uses its own packages). A tool that cannot serve a schemedelegates it (cross-language fetch) or errors with unsupported scheme — it MUST NOT silently skip it. - HTTP/HTTPS are deliberately not in the object-store set — they keep their own dedicated GET path.
- A
urifetched by any scheme issha256-verified like any other download; an object-store URI is just another source of bytes. (To open such a URI without downloading, setlazy_access; see the field table.)
Load ladder¶
The tool tries each rung in order:
[<dataset>._LANG.<self>].loader, else the bare[<dataset>].loader;[_LANG.<self>.loaders][<dataset>.format], else[_LOADERS][<dataset>.format]— manifest-configured format default;- the tool's built-in default loader for
<dataset>.format; - else error.
At each own-language rung the explicit _LANG.<self> binding takes precedence over the
bare one. A binding that is present for the running language (bare, or explicit
_LANG.<self>) but fails to resolve is an error; one that resolves and then raises
propagates. The ladder falls through only to skip rungs that are absent (another
language's _LANG.<other> fetcher, or no own loader), never to paper over a broken present
binding (see Language-implicit bindings).
Load never delegates. A loader returns a live in-memory native object, which cannot cross a process boundary. Cross-language data preparation is modeled as one language's fetcher writing a normalized artifact (Arrow/parquet/netcdf) that another language then loads with its own format default.
Cross-language fetch (rung 3)¶
Reached only in the rare case that a dataset has no fetcher in the running tool's own
language, no shell command, and no uri — its bytes can be produced only by a fetcher
defined in another language ([<ds>._LANG.<other>].fetcher). Native / shell / plain
uri cases never reach here, so each implementation is self-sufficient for nearly all
datasets.
How a tool runs a foreign fetcher is implementation-defined. It MAY invoke that
language's runtime directly (e.g. julia --project=<env> -e '…', writing to
$download_path and materializing the result itself), MAY delegate to a peer-language
datamanifest CLI (see Peer-CLI contract), or MAY skip the rung. The Python
implementation is the reference and aims to cover every language, so a tool with no
native way to run a foreign fetcher can simply call the Python CLI as a fallback.
Either way it moves bytes on disk only (load never crosses languages); a tool MUST fall
through to uri when the needed toolchain is absent; and it applies to fetched datasets
only — produced (@cached) datasets are not cross-language. Gated by the delegation
capability; the delegate field / --delegate toggles it.
Storage¶
Storage reduces to two paths: where fetched datasets go, and where the produced cache
goes. Both are set in [_STORAGE] and default to local, repo-relative folders, so a
casual user gets visible ./datasets/ and ./cached/ with zero configuration and nothing
derived:
[_STORAGE]
datasets_dir = "datasets" # fetched datasets (default; relative => <repo>/datasets/)
datacache_dir = "cached" # produced cache (default; relative => <repo>/cached/)
- Relative ⇒ relative to the project root (the manifest's directory,
$repo). Absolute,~-, or$symbol-rooted paths are used as written. - Resulting paths are flat: a fetched dataset lands at
<datasets_dir>/<key>; a produced artifact at<datacache_dir>/<cachetype>/[<version>/]<hash>/(see Produced datasets and caching). No partition, no prefix, no derived name in between — the folder you set is the location.
Storage is a portable location layer: the symbols below resolve to the same place in every
implementation, so peer tools (Python / Julia) share an on-disk location without
configuration. The core knows locations only — no lifetime policy; disposability and GC of
produced datasets are the cache layer's concern (see Produced datasets and caching), not the
core fetch engine's. [_STORAGE] and its _HOST sub-table are defined structural keys: a
conforming tool parses them identically and preserves them verbatim (a tool without the
storage capability treats the whole table as a preserved unknown).
Read pools (reuse what is already on the machine)¶
Two optional [_STORAGE] list fields name read-only locations to reuse before
materializing, so an object another project already has is not fetched or
recomputed again:
[_STORAGE]
datasets_pools = ["~/.cache/Datasets", "$user_data_dir/datamanifest/datasets"]
datacache_pools = ["$team/cache"] # opt-in; no default
datasets_pools— fetched datasets. Resolution probes each pool (at the dataset's keyed sub-path) after the recorded and directive-derived locations and before downloading. On a hit the declaredsha256is verified (a mismatch is skipped, never trusted), the location is recorded in the state file, and the bytes are used in place — no copy. A genuine download still goes todatasets_dir(the directive — gold standard). Default when the field is absent: the built-in well-known pools (~/.cache/Datasets,$user_data_dir/datamanifest/datasets); an explicit list is used verbatim; an empty list disables pools.datacache_pools— produced artifacts. Symmetric: the produced-dataset hit-search also probes each pool at<pool>/<cachetype>[/<version>]/<hash>, gated by the usualconfig.tomlvalidation, and self-heals the record on a hit. It is opt-in — absent means no pools (and no built-in default): there is no de-facto shared compute location, and a produced artifact carries no content checksum (only itscachetype/version/hashidentity), so cross-project adoption must be deliberate.- Read-only and host-local. Pools are never written to and never edit
datasets.toml; they only add candidate locations to read-resolution. They are resolved like any other path expression ($-symbols,~, env) and are host-composable via[_STORAGE._HOST](a pool list set per hostname glob); environment overrides areDATAMANIFEST_DATASETS_POOLS/DATAMANIFEST_DATACACHE_POOLS. Honored under thestoragecapability; other tools preserve the fields verbatim.
Symbols¶
A folder path may interpolate $-symbols: $NAME / ${NAME} expands to a defined symbol,
else to the environment variable NAME; ~ expands to home. A symbol is defined by a bare
key and referenced with $.
Predefined (platform-resolved; the names map directly to platformdirs):
| Symbol | Resolves to |
|---|---|
$user_data_dir |
platformdirs.user_data_dir() — the machine's user data dir (persistent) |
$user_cache_dir |
platformdirs.user_cache_dir() — the machine's user cache dir (reclaimable) |
$repo |
the project root (the manifest's directory); the base for relative paths |
They are bare — no datamanifest/app name is appended; you namespace explicitly
(datasets_dir = "$user_data_dir/myproj"). Every implementation MUST resolve them to the
identical path and MUST NOT substitute a language-native location (e.g. a package depot), so
Python and Julia agree. $USER and any other environment variable are available too.
User-defined — any other bare key under [_STORAGE] is a reusable symbol, and may be made
host-specific in [_STORAGE._HOST."<glob>"] (matched against the hostname). The two
fields themselves may also be host-specific:
[_STORAGE]
datacache_dir = "$scratch/cache" # reference a custom symbol
scratch = "/scratch/$USER" # its default definition
[_STORAGE._HOST."login*.hpc.edu"] # hostname glob / regex
scratch = "/work/$USER" # host-specific value
datasets_dir = "$user_data_dir/shared" # a field, host-specific
Resolution ladder (symbol or field; first match wins): DATAMANIFEST_<NAME> environment
variable → matching [_STORAGE._HOST.<glob>].<name> → base [_STORAGE].<name> → (predefined
symbols only) the platformdirs / project-root default. Host-specificity always lives in the
symbol's resolution, never a per-dataset host map.
To centralize and share across clones, branches, or projects, point the two fields at a machine dir under a name you choose — one self-documenting edit, nothing derived:
Environment¶
Exactly two environment variables override the fields — for HPC / CI / containers where
editing the manifest is inconvenient: DATAMANIFEST_DATASETS_DIR and
DATAMANIFEST_DATACACHE_DIR. (User-defined symbols override as DATAMANIFEST_<NAME>.)
Per-dataset path (storage_path)¶
A dataset MAY override where it lives with the storage_path field — a path expression
that defaults to $datasets_dir/$key ($key is the dataset's storage key, see key). It
generalizes the former local_path (which was exact-only) and subsumes the former store; the
only distinction was whether the key is appended:
- contains
$key⇒ a tool-managed, keyed location; the dataset is materialized there and maintenance MAY act on it.storage_path = "$scratch/$key"parks one heavy dataset on scratch. - an exact path without
$key⇒ a user-managed location, used verbatim, bypassing the keyed layout — and maintenance never touches it.storage_path = "$cmip/AMIP/tas.nc"points at a file you manage; with$cmipresolved host-specifically, that is the heavy-archive-per-host pattern (host-specificity in the symbol, never a per-dataset host map).
(The name path is not used for this — it is already the URI's parsed path component.)
In-memory and multiple manifests (library use)¶
A manifest is a logical structure; a datasets.toml file is its canonical serialized
form, but a tool MAY construct and hold one in memory, and several MAY be live at once
in one process — each resolving its own datasets, bindings, and storage independently (an
in-memory manifest resolves identically to a file-backed one). The construction surface is
per-language and non-normative (Python's Database, Julia's Database); whether and where
such a manifest is persisted is the author's choice.
This is how a library owns its own data without touching the end user's datasets.toml:
it builds its own manifest, declares its datasets via the language API, and sets its
datasets_dir / datacache_dir explicitly (e.g. under $user_data_dir/<library>) — a
relative default would otherwise resolve against the end user's project, never the installed
library's. Recommended: a library shipping data dependencies should own them this way (its
own manifest + explicit folders); without explicit folders, its data correctly falls back to
the end user's project, who owns the location.
Concurrent access and completeness¶
A folder may be shared between tools and between concurrent processes (e.g. HPC jobs), so materialization MUST be safe under concurrency, and peer tools sharing a folder MUST agree on these conventions:
- Atomic publish. Materialize into a temporary path within the same store partition as
the final path (e.g.
<key>.tmpbeside<key>) and atomically rename it into place, so a killed process never leaves a partial entry that looks complete. Staging is never a central/global cache: a download (or produce) destined for$scratchstages on$scratch— required for the rename to be atomic (same filesystem) and essential for voluminous data, which must never transit a small~/.cachefirst. - Completion marker. An entry is complete iff its marker exists —
<key>/.completefor a directory,<key>.completefor a file. Readers MUST treat an entry without its marker as absent (re-fetch); a writer MUST create the marker only after a successful, verified materialization. - Lock. A writer SHOULD hold an exclusive lock
<key>.lock(a pidfile; a lock whose PID is dead and older than a grace period MAY be reclaimed) while materializing, so concurrent workers neither recompute nor clobber the same entry.
Produced datasets and caching (companion layer)¶
Spec-v2.1 — a companion layer, not a core capability. The produce-or-load (
@cached) layer sits outside the core fetch engine as a distinct capability layer built on the shared substrate it reuses (safe-materialization, folder resolution, loaders). This section is the cross-tool format spec for that layer — it stays in this document so both languages agree on the on-disk shape.Packaging is not constrained by this spec. Whether an implementation ships the layer as a separate package that depends on the core, or as an optional module of the same package, is the implementation's choice (spec-v2 over-specified this as a separate package; spec-v2.1 relaxes it). What the spec fixes is the boundary: the
cache-produce/inspectcapabilities are never declared by the core fetch capability, and the core fetch engine keeps no garbage collection and no disposability policy.The format is additive over a
datasets.toml: it adds no field to the hand-authoreddatasets.tomland does not change its_META.schema(still 1). A produced dataset reuses the existing engine — the storage model, the safe-materialization primitive, and the load ladder — but is not declared indatasets.toml; its only on-disk record is the machine-generatedconfig.toml/metadata.tomlsidecars (each_META.schema = 1) and the project's state file (_META.schema = 5). The format is gated by two independent capabilities,cache-produceandinspect, so a companion may ship neither, one, or both.
A produced dataset is one whose bytes come from running a project function rather
than downloading a uri. It is the same "recipe + key + store + policy" object as a
fetched dataset; the distinction is purely two slots:
- the recipe — a
uri/shell/gitrecipe with a source-identity key, versus a function recipe with a parameter-hash key; - the authorship — a fetched dataset is hand-authored in
datasets.toml; a produced dataset is machine-generated.
Everything else — the storage folders, the safe-materialization primitive, loaders, the preservation contract — is recipe-agnostic and applies unchanged. The only new normative axis is parameter-hash keying and its on-disk bookkeeping.
A produced dataset has no entry in datasets.toml. It originates from a function
that a tool exposes through its produce-or-load surface (the @cached decorator /
macro — per-language and non-normative; see below), and is recorded only after it
runs. cachetype is therefore not a datasets.toml field; it is a namespace that
appears solely in the machine-generated records — the state file's datacache entry,
the config.toml [_META] block, and the on-disk path. A conforming fetch path
(download_dataset and the fetch ladder) never encounters a produced dataset: the
two concerns share the engine, not the manifest. This keeps datasets.toml clean
(the hand-authored, git-committed spec) and confines produced,
parameter-hash-keyed churn to the git-ignored state file (regenerable ground
truth).
A produced dataset is identified by its keyword parameters, not by content: its
storage key is <cachetype>/<param-hash>, its parameters are the hash inputs,
and the config.toml sidecar is the re-checkable record of those inputs (it is not
content-pinned by a manifest sha256). It is materialized under datacache_dir
(default cached, relative ⇒ local ./cached/), the producing layer's folder.
Keyword-only. Because the parameters double as identity, the producing function is keyword-only for hashing: an ordered positional argument list has no stable name→value identity to hash, so a
cache-producetool MUST derive the key table from keyword parameters only. This is a property of the produce-or-load surface, not adatasets.tomlrule — fetched datasets keep their positionalargsper spec-v1.1.
Parameter-hash keying¶
A produced dataset's identity is (cachetype, hash-of-its-parameters). The
hash inputs are the producing function's hash-affecting keyword parameters as a
key table: a mapping parameter name → value. Parameters split three ways
(normative — the split, not the exact source mapping):
| Class | In the hash? | Stored where | Example |
|---|---|---|---|
| hash-affecting params | yes | config.toml sidecar |
grid = "5x5" |
runtime knobs (_-prefixed keys) |
no | nowhere (transient) | _parallel = true |
| audit-only extras | no | metadata.toml sidecar |
producing git commit |
How a tool derives the key table from a function's declared keyword parameters
(signature introspection, an explicit key selector like LGMIO's key=(args -> (;…)),
etc.) is implementation-defined; the serialization and hash are normative so
the same parameters yield the same key everywhere:
- Build the key table from the hash-affecting keyword parameters, excluding every
key whose name begins with
_(those are runtime knobs). - Serialize it to canonical JSON (JCS, RFC 8785): object members sorted by
Unicode code point at every nesting level, no insignificant whitespace
(member separator
,, name separator:), UTF-8 output with minimal JSON string escaping. To keep canonicalization unambiguous, hash-input values are restricted to strings, integers, finite floats, booleans, and arrays/objects composed of those. A finite float serializes through this same canonical-JSON projection — the Python referencejson.dumpsfloat form is normative (1.0→1.0,0.1→0.1) and a non-Python tool MUST reproduce it byte-for-byte.NaNand±Infare disallowed (no JSON representation), as are nulls (an absent parameter is omitted, not encoded as null). A float-valued knob MAY still be passed as a string when maximal cross-tool hash stability matters, since float formatting is the most implementation-sensitive case. Array element order is significant (arrays are data); object key order is not (sorted). - The parameter hash is the lowercase hex SHA-256 of those canonical UTF-8 bytes.
- The storage key is
"<cachetype>/<hash>"— or"<cachetype>/<version>/<hash>"when an optional recipe version is set (see Produced-artifact location). (Tools MAY display a short hash prefix, but the on-disk directory and all references use the full 64-hex digest.)
Canonical JSON (rather than TOML) is the hash input precisely because it has a
fully-pinned byte form that Python (json.dumps(obj, sort_keys=True,
separators=(",", ":"), ensure_ascii=False)) and Julia produce identically
today for strings, integers, booleans, and their arrays/objects, independent of
the cross-tool TOML byte-identity work (for finite floats the Python form is the
normative reference a non-Python tool reproduces). So a produced
dataset resolves to the same <cache_root>/<cachetype>/<hash> path under either
tool (even though the artifact bytes a given tool writes there may be
language-specific; cross-tool loading of a produced artifact is not implied,
only cross-tool addressing and maintenance). The config.toml sidecar
stores the same key table in human-readable TOML; the hash is over its canonical
JSON projection, not over the TOML bytes.
Identity: the cachetype namespace¶
cachetype is the disambiguation namespace that, paired with the parameter
hash, identifies a produced artifact. It has a default and an override:
- Default — the producing function's canonical importable name. Absent an
explicit
cachetype, a tool MUST derive it from the function's fully-qualified, importable name in the host language (Python:module.qualname, e.g.mypkg.analysis.produce; the cross-language rule is "the canonical name by which the runtime re-imports that callable"). An explicitcachetype = "<name>"override remains — for a stable hand-chosen name, or to deliberately group several functions under one namespace. The auto and explicit forms share one namespace: an explicitcachetypeequal to the derived name denotes the same identity. (So the defaultcachetypecoincides with the state file'sdatacacheentryref— by default the namespace is the producer's identity.) - Why unique-per-function is the right default. The worst cache failure is
silently mixing unrelated results under one key, so the default must be
unique per producing function. The accepted, prominently documented
consequence: renaming or moving the function (or restructuring its package)
changes its
cachetypeand orphans the prior artifacts — the correct behavior, since the code identity that produced them is gone.versionremains the tool for deliberate busting; re-pinning an explicitcachetypeis the tool for deliberate continuity across a rename. - No stable identity ⇒ require an explicit
cachetype. When the producing function has no stable importable identity — a top-level script, a REPL, an eval-string, a notebook — a tool MUST NOT synthesize an ambiguous cachetype; it MUST require an explicitcachetypeand error otherwise. This is the same constraint native serialization already imposes (an object defined in the entry-point module has no portable qualified name). (Python reference, non-normative: a__main__function is resolved via the launch's recorded module identity —__main__.__spec__.name, which Python sets forpython -m pkg.mod→pkg.modbut leavesNonefor a loose script,-c, the REPL, and notebooks — so those require an explicitcachetype.)
Identity conflicts (same (cachetype, version), same process)¶
Because cachetype can be set explicitly, two distinct producing functions can
be made to claim the same namespace. A tool SHOULD guard the dangerous case:
if two distinct functions claim the same (cachetype, version) pair while
simultaneously live in one process, it SHOULD raise immediately and name both,
rather than let them silently share a key.
- The key is the pair. The same
cachetypewith differentversions is a valid, tolerated case (e.g. two functions runningcalibrationv1 and v2 at once). Storage location is irrelevant to the check — acachetypemust be unique regardless of where its artifacts are written. - The guard is intentionally same-process / same-time. Equally-named functions used in separate runs simply share the slot, which is permitted — a user may engineer that, or split modules / set explicit cachetypes to keep them apart. There is no static cross-process check, and none is wanted: "live in one process at the same time" is exactly the boundary where a collision is unambiguously a mistake.
- Transient functions are exempt. Nested, local, or anonymous functions (a
closure, or a function defined inside another) are dynamic and short-lived, so
they are exempt from the guard — and they typically lack a stable importable
identity, so they already require an explicit
cachetype(above) to be cached at all. - The mechanism is implementation-defined. (Reference, non-normative: a
process-local registry populated at decoration/registration time — keyed by the
function's
refso a re-import overwrites rather than duplicates — with no disk writes; a tool that never imports the user's code never sees the recipes, and separate processes keep separate registries, so the guard never entangles caches across projects.)
When and how to detect is the implementer's call. A tool SHOULD run the check at the earliest reasonable and practical point for how its host language loads and runs code. In a dynamically-loaded language (Python) that is decoration / import time. In a language that mixes ahead-of-time precompilation with live execution (notably Julia) the right point is less obvious — registration done at top level may run during precompilation rather than in the user's session — so a faithful, low-surprise guard might instead belong at first call or a runtime-init hook, or, if precompilation makes a sound guard impractical, be narrowed or omitted. The spec fixes only the semantics — two distinct functions, the same
(cachetype, version), simultaneously live — never the timing or mechanism; that is left to each language's implementer to realize (or to judge infeasible), which is why the guard is a SHOULD.
Produced-artifact location¶
A produced artifact composes its path under the datacache_dir (see Storage):
The produce surface resolves datacache_dir from the same [_STORAGE] as fetched data
— it MUST read the nearest discovered manifest's [_STORAGE] (the same upward walk used to
find the project root; a plain TOML read, no fetch layer), so produced and fetched data share
one storage configuration. By default datacache_dir = "cached" (relative ⇒ local
./cached/), so produced artifacts are local and visible like everything else;
centralizing/sharing is the explicit datacache_dir = "$user_cache_dir/<name>" edit
(Storage). There is no scope, prefix, or partition in the path — the folder you set is the
location.
- version — an optional recipe/code version segment (below).
- An explicit per-call location — given per
@cachedcall (cache_dir = …) — is used verbatim (<cache_dir>/<cachetype>/[<version>/]<hash>), the experiment-folder workflow.
The location affects location only — never hit validity, which is the key
(cachetype + hash) alone.
Recipe version (optional). A producing call MAY carry a short version string
(e.g. "v3", a date). When set it becomes a path segment between cachetype and hash
(<cachetype>/<version>/<hash>) and is recorded in the config.toml sidecar and the
state file's datacache entry (in the recipe key, <cachetype>@<version>). version does not enter the parameter hash; it is an explicit,
human-set recipe/code version, orthogonal to the parameter key and distinct from
_META.schema (the on-disk format version). Its purpose is correctness under sharing: a
change to a function's logic that leaves its parameters unchanged would otherwise read a
stale hit from another branch or clone — bumping version forces a miss. When unset, the key
stays <cachetype>/<hash>.
Cache layout and sidecars¶
A produced artifact is materialized at the composed path above —
<datacache_dir>/<cachetype>/[<version>/]<hash>/ (see
Produced-artifact location) — via the same safe-materialization primitive (atomic publish,
.complete marker, .lock pidfile) as any other store write. The directory is
self-describing through two sidecars written next to the artifact:
…/<cachetype>/[<version>/]<hash>/
├── <basename>.<ext> # the produced artifact (format-determined)
├── config.toml # the re-hashable hash inputs (the key table)
├── metadata.toml # provenance / audit (never hashed)
└── .complete # completion marker (file form: <hash>.complete alongside)
config.toml (cache-produce) — the key table verbatim plus a [_META]
block, so any tool can recompute the hash and confirm the directory's identity. The
key table is written at the root and first (TOML requires root-table keys to
precede any table header), so [_META] comes last; reading back, the key table is
every root key except the [_META] block:
# --- hash-affecting parameters (the key table) ---
grid = "5x5"
skip_models = ["CESM.*", "FGOALS.*"]
[_META]
schema = 1
cachetype = "esm_20c_anomaly"
# version = "v3" # optional recipe/code version (becomes a path segment when set)
# hash = SHA-256( {"grid":"5x5","skip_models":["CESM.*","FGOALS.*"]} ), canonical JSON:
hash = "83425a30d111562d46c1fce9de7618ea7f1f54e1be72e086cba0ac63c6f2ce9b"
(83425a3… is a verifiable reference vector: it is the SHA-256 of the canonical
JSON {"grid":"5x5","skip_models":["CESM.*","FGOALS.*"]}. Every conforming
cache-produce implementation MUST reproduce it.)
A tool with cache-produce MUST be able to recompute the hash from config.toml's
key table and MUST treat a directory whose recomputed hash ≠ _META.hash as
not a valid cache hit (re-produce).
Because format is a serialization choice and not a hash input, several formats of
the same computation share one <cachetype>/[<version>/]<hash> directory (a data.<ext>
per format). A hit is therefore valid only when the data file for the requested format
is present: a complete, hash-valid directory whose data.<ext> for this format is absent
recomputes (writing that format) rather than failing — so two recipes that share a
cachetype and hash to the same key but emit different formats coexist instead of
colliding.
metadata.toml (cache-produce) — provenance only, never an input to the
hash and never an authority for cache validity:
[_META]
schema = 1
created = "2026-06-02T15:04:05Z" # RFC 3339 UTC
tool = "datamanifestpy 0.17.0" # producing tool + version
host = "login3.hpc.edu"
user = "mahe"
[git]
commit = "1f8839c…"
branch = "main"
dirty = false
[origin]
state_file = "/home/mahe/proj/.datamanifest-state.toml" # the state file that inventories this artifact
Default serialization format (per language)¶
A produced dataset MAY omit format. When it does, a tool serializes the returned value
with its language-native default format and reads it back with the matching built-in
loader — so a bare return value round-trips with no configuration. This default is per
language and RECOMMENDED, not normative (native serialization is language-private and
version-sensitive; the spec pins cross-tool addressing, not the blob), but each
conforming cache-produce tool SHOULD define one and SHOULD ship both the saver (value
→ bytes) and the matching built-in loader (bytes → value):
| Language | RECOMMENDED default format |
saver / loader |
|---|---|---|
| Python | pickle (data.pickle) |
pickle.dump / pickle.load |
| Julia | jld2 (data.jld2) |
JLD2.save / JLD2.load |
An explicit format always overrides. Bytes in a language-native default format are not
cross-language-loadable by construction (pickle is Python-only, jld2 Julia-only) — which
is consistent with the spec pinning only cross-tool addressing and maintenance, never the
blob format.
The state file (.datamanifest-state.toml)¶
The hand-authored datasets.toml stays clean — it is the spec: what to
track and how to obtain it (a dataset's uri/fetcher/shell, a @cached
function's code), hand-authored and git-committed, the source of intent. What
it records is the expectation — a per-dataset directive storage_path (where
bytes should go) and a contract sha256 (what they should hash to).
Where each object actually landed on this machine is recorded separately, in a
sibling .datamanifest-state.toml — the state file: a tool-maintained,
per-object inventory of resolved on-disk locations. It is the machine analogue of
a lockfile, except it is not a committed reproducibility lock — it says
nothing about how to re-obtain a resource (that lives in the spec), so it is no
help to a fresh clone; it is regenerable local state, and therefore
git-ignored by default. (Produced artifacts and many fetched datasets live on
one machine, often outside the repo, and cannot be pulled from the internet — so
committing the inventory would only record paths no other clone can use. A project
whose data sits on a shared drive every clone can reach MAY choose to track it,
but that is a user's setup, not the design intent.) The leading dot marks it as a
CLI-read dotfile, not a hand-edited document.
The state file inventories both kinds of materialized object — fetched
datasets and produced artifacts — under two top-level namespaces (datasets and
datacache) parallel to the two storage folders, so the two never collide and
each is greppable on its own. _META.schema = 5:
[_META]
schema = 5
# --- fetched datasets: storage key → resolved location (+ actual checksum) ---
[datasets."example.com/foo.nc"]
storage_path = "datasets/example.com/foo.nc" # where the bytes actually are
sha256 = "abc123…" # actual digest; omitted under skip_checksum
# --- produced artifacts: cachetype[@version] → instances{hash → location} ---
[datacache."lgmpre.data.load_20c@v3"] # bare cachetype when unversioned
ref = "lgmpre.data:load_20c" # producing module:function (refreshed across a refactor)
format = "nc"
[datacache."lgmpre.data.load_20c@v3".instances]
"83425a30d111562d46c1fce9de7618ea7f1f54e1be72e086cba0ac63c6f2ce9b" = "cached/lgmpre.data.load_20c/v3/83425a30…"
datasets— fetched. Keyed by the dataset's existing storage key (host/path[#version], its machine-independent identity — no new id). Each entry records the resolvedstorage_path(where the bytes actually are) and the actualsha256of what is on disk (omitted when the dataset setsskip_checksum, so a very large file need never be hashed).datacache— produced. One table per recipe, keyed by<cachetype>[@<version>]—@is a reserved version separator, so a bare key is the unversioned recipe and two versions of one cachetype never collide (acachetypeMUST NOT contain@, whichmodule.qualnamenever does). It carries recipe-levelref/formatand aninstancestable mapping each produced variation's parameterhashto the full artifact directory it was written to. The params are not stored here — they live in each artifact'sconfig.tomlsidecar (read at enumeration), so a single artifact's recorded location is its own complete record.- Spec vs. state — the same two fields, two meanings. A dataset's
storage_pathandsha256appear in both files on purpose: indatasets.tomlthey are the expectation (the directive for where bytes go / the contract digest); in the state file they are ground truth (the resolved location / the actual digest). The duplication is intentional and harmless — the state file is derived and disposable. (Fully separating directive-from-resolved and expected-from-actual is a future cleanup; seeROADMAP.md.) - The state file is a defined structural sibling format with its own
_META.schema. A tool that implements neitherinspectnor produced caching need not read it. Earlier shapes (_META.schema1–4: the produced-only flat and nestedcached.tomlforms) are still read and migrated forward, andcached.tomlis the recognized legacy name. A produced artifact'smetadata.tomlcarries astate_fileback-pointer to the file that inventories it (audit only).
The state file is read-only inventory (the gold standard)¶
The guiding invariant: the state file records where things are and is consulted
to find an existing object — it never directs a write. Every
(re)materialization follows the current directive — datasets_dir /
datacache_dir, a per-dataset storage_path, an explicit
@cached(storage_path=…) — which is the gold standard; the recorded location
only short-circuits the lookup.
- Read-resolution checks the state file first. Resolving where an object lives
consults the recorded
storage_pathahead of any derivation rule: if the bytes are actually there (and, for a dataset, checksum-valid when a digest is recorded), that is a hit — no re-derive, no re-download, no recompute. Only on a miss does resolution fall back to the directive-derived path, then any read pools (§Storage — reuse a copy another project already has, recording it on a hit), and only then fetch or produce. This is how a moved object is still found at its new home. - Self-heal is additive, never destructive. Active resolution refreshes the record to match reality: a relocated object (bytes at the derived path, record stale) has its recorded location refreshed; an untracked object (bytes present, no entry) is registered; a missing one is re-materialized at the current directive and recorded. Because any access that consults the state file and finds nothing proceeds to fetch/produce, resolution can never leave a stale record. The relocate-refresh is the only automatic mutation; active resolution never deletes.
- A deleted or missing state file repopulates as objects are accessed — it is regenerable by construction.
- Concurrency. Every write re-reads the file, merges (additive union, last-writer-wins per object), then writes via a temp file + atomic rename, so parallel downloads / produces cannot clobber each other; additive-only updates make the merge conflict-free.
- Garbage vs. dirty. A malformed entry that roots nothing (e.g. an instance-less residue from a format change) is corruption, not a tracked-but-missing object — it is cleaned silently on read. A well-formed entry whose bytes are merely absent is a dirty state (below), surfaced rather than silently dropped.
Dirty states and explicit reconciliation¶
The state file is a first-order source of truth for where objects are, kept
non-destructively (git-style). A tool MAY classify each object's
state-vs-disk as clean, missing (recorded bytes gone), relocated
(recorded at L, bytes at the derived path D), untracked (bytes present,
no entry — an orphan for produced artifacts), or modified (recorded
sha256 ≠ actual; the full expected-vs-actual digest treatment is deferred).
Passive listing only labels these states — it never mutates.
Two explicit, user-invoked actions reconcile:
--refresh— fix the state file only (no downloads, no file moves): re-point relocated entries to where the bytes actually are, and drop stale / missing entries. A pure state↔disk reconcile. (Untracked artifacts are picked up by active access, not here.)--delete— remove the selected objects' bytes and their entries; the only byte-removing action.
Removal is therefore explicit-only: passive listing and active resolution
never delete, --refresh only edits the record, and --delete is the sole byte
remover.
Maintenance (inspect, filter, delete)¶
Both fetched datasets and produced artifacts accumulate, so a tool MAY offer maintenance
(the inspect capability): enumerate the physical store, filter it, and delete an explicit
selection. Maintenance is user-driven, never automatic: it surfaces what is stored and
deletes only what the user explicitly selects. There is no automatic garbage collector —
see Why not automatic reachability below.
Reference CLI (non-normative). A tool exposes this however fits — like the
@cachedsurface, the command shape is per-tool, not part of the spec. The reference design folds it into the existingdatamanifest list: a default summary view with--fieldto pick columns; filter flags over the object fields (--kind,--folder,--orphan,--dirty,--older-than,--format, size); and action flags on the selected set —--delete,--refresh(reconcile the state file only), and optionally--move(dry-run/confirm by default). There is no separategccommand.
- Object fields. Each stored object — a fetched dataset or a produced artifact — exposes a common set of inspectable fields:
kind—data(fetched) orcached(produced);key—<key>(fetched) or<cachetype>[/<version>]/<hash>(produced);hashfor produced;location— the resolved absolute path on disk;referenced— whether a still-present local.tomlroots it or it is an orphan. For a produced artifact the match is the full(cachetype, version, hash)tuple against a state-filedatacacheinstance, so another project's artifact (in a different folder) is not mistaken for referenced; for a fetched dataset, its key listed in adatasets.toml;format,size,created, and a best-effort, filesystem-derived last-access time (read fromstat, never written on read; MAY be unknown).
These fields are the cross-tool inspectable surface; a tool with the inspect capability
MUST be able to report them.
- View. A tool presents a default summary (the most useful columns) and SHOULD let
the user choose which fields to show.
- Filter. Any field is a filter predicate — kind = cached, folder = <path>,
referenced = false (orphans), last-access older than 90d, format, size — so the
user can target, e.g., "orphaned produced artifacts in this folder not accessed in 90 days."
- Act on the selection. Actions operate on exactly the filtered set, uniformly
across fetched datasets and produced artifacts: delete (remove the bytes and
prune the object's state-file entry) and optionally move (relocate the bytes and
repoint the recorded storage_path — datasets.toml is not edited, so a later
re-fetch still follows the datasets_dir directive; gold standard). A tool MUST NOT
delete everything by default and MUST NOT delete as a side effect of any other
command. The explicit filter + action is itself the selection (typing the action over a
filtered set is the confirmation), so a tool MAY apply directly; it SHOULD offer a
--dry-run preview. Deletion is always of a user-chosen set, never an automatic sweep.
- Protections (the rule is unchanged, generalized). Maintenance never touches data the
user owns: a fetched dataset whose storage_path is a user-managed exact path (no
$key) or that is skip_download (a passive, externally-managed dependency) is reported
as skipped, never moved or deleted — the same guard already used for deletion, applied to
both kinds. A lazy_access dataset has no local copy to touch in the first place.
Tool-managed (keyed) objects under datasets_dir / datacache_dir are fair game.
Both kinds are reclaimable, with different regeneration costs. Deleting a fetched
dataset under datasets/ just means it re-downloads on next use; deleting a produced
artifact under cached/ means it recomputes. Neither destroys irreplaceable state — the
manifest and the producing function are the sources of truth — which is exactly why a
curated delete is safe and an automatic collector is unnecessary. storage_path data and
anything outside the tool-managed datasets/ / cached/ trees are never touched by
maintenance.
Why not automatic reachability. An earlier draft computed liveness as "no manifest references this key." That model has a hole: the record documents an artifact's producer, but a read-only consumer — another project, or a fresh clone, reading a shared- or group-scoped artifact it did not produce and does not list — never registers as a referrer, so an automatic collector would reap an artifact still in active use. Rather than patch this with ever-more-complete bookkeeping, maintenance keeps a human in the loop; liveness signals are advisory inputs to a filter, not a deletion authority:
- Last-access is filesystem-derived and best-effort — never written on read. A tool
reads it at inspect time from the artifact's filesystem metadata (the
stataccess time, falling back to the modification time orcreatedwhen atime is unusable); it MUST NOT rewrite any sidecar, index, or.tomlon read to record access. (Touching a file on the lock-free read path would contend with the produce.lock, serialize concurrent readers, and put I/O on the hot path — all for a value that is purely advisory.) Because the OS maintains it, a read-only consumer's use is still reflected wherever the filesystem records it, but the signal is coarse and may be absent:relatimeadvances atime at most once a day, andnoatime, network, and read-only filesystems record nothing — so a tool MAY reportlast-accessas unknown. It is advisory only, never the sole basis for deletion;created(stamped once at produce time inmetadata.toml) is the always-available age signal. - The
referencedfield is advisory too. "Is this orphaned?" (no still-present local.tomllists its key) is one more column to show and filter on — input to the user's choice, not an automatic trigger. An orphan is a strong delete candidate, never an automatic deletion. - The per-artifact
metadata.tomlback-pointer remains audit only — never a deletion authority (it goes stale and cannot express multiple references).
Because each project's data lives under its own folder (datacache_dir / datasets_dir),
"show/delete only this project" is a trivial path filter, and inspecting across projects never
risks an accidental cross-project wipe because deletion is always an explicit selection.
What this spec does not specify¶
- The
@cachedmacro / decorator API (Julia macro, Python decorator) is the ergonomic surface over this model and is per-language, not normative — a tool exposes it however fits the language. Normative are: the on-disk formats (key hash,config.toml,metadata.toml, the state file.datamanifest-state.toml), the path composition (thedatasets_dir/datacache_dirfolders), the identity rules (cachetypedefault + the stable-name requirement; the(cachetype, version)same-process conflict guard; the index lifecycle), and the maintenance rules (user-driven; no automatic deletion). How a tool derives a defaultcachetypeor detects the conflict is implementation-defined; that it does is not. - The artifact serialization format (
jls/jld2/pickle/…) is a per-tool, per-formatchoice; produced artifacts are not assumed cross-language-loadable. A tool SHOULD define a RECOMMENDED language-native default for a format-less produced dataset (see Default serialization format), but the spec does not mandate which. - In-place access (no local copy) is the
lazy_accessmode (see the dataset field table): theuriis opened where it lives by a loader. The mechanism — streaming, an sshfs/FUSE mount, an object-store filesystem — is implementation-defined and not specced; the former standalonemountstore is subsumed bylazy_access(one materialization axis: download vs. in-place), so no separatemountcapability is defined. What a tool actually supports depends on its loaders and backends. - Cloud /
fsspec/ CAS backends are not a core model; if a tool adds them, they are optional per-language extras behind the recipe interface, not a spec contract.
Cross-machine sync¶
A stored object — a fetched dataset or an expensive produced artifact — can be transferred
between machines instead of re-downloaded or recomputed, because every object has a
machine-independent address (a fetched dataset's source key; a produced artifact's
cachetype[/version]/hash): the same object has the same logical address everywhere, only
the physical root differs. Sync is a transfer between two stores, gated by the optional
sync capability.
- Target = an SSH address (
user@host): SSH is both the transport (rsync over ssh) and the host identity; no separate remote registry is required. - Each end resolves its own store from its own environment (
DATAMANIFEST_*) plus the manifest's[_STORAGE._HOST]rules — not from any knowledge of the remote's project folder. This works because syncable objects live in machine-global folders ($user_data_dir/$user_cache_dir/ user-defined); a$repo-relative (local) object is not syncable, which is exactly why the remote project location never has to be known. - Symmetric.
pushandpulldiffer only in transfer direction; each side resolves its own store identically, so there is no asymmetry between them. - Writes no manifest. Sync moves bytes only; it never edits
datasets.tomlor the state file on either end. A transferred object lands in the global store as an orphan (present, unreferenced) — immediately usable via read-resolution, and registered by the receiver's normal flow if and when its own project uses it. - Integrity is the transport's (rsync verifies every file as it copies); the spec adds no
separate artifact-level digest. Idempotent: a no-op when the target already holds the
object complete (its
.completemarker is present). - The receiver's folders route it: a transferred object lands under the receiver's own
datasets_dir/datacache_dir, resolved from its environment. The recipeversionkeeps produced-sync safe — logic that changed bumpsversion, so a divergent artifact never overwrites at the same address.
Addressing for sync. An object is named by its identifier: a fetched dataset by name /
alias / doi; a produced artifact by cachetype[/version]/hash (full, or an unambiguous
hash prefix). Resolution to exactly one object is the contract — an identifier matching
several is a fail-loud error (the general exact-or-error rule; see Identifier resolution),
so a bare cachetype or a shared doi is ambiguous and must be disambiguated.
Reference CLI (non-normative). First-order
datamanifest push <id> <ssh-host>/pull <id> <ssh-host>transfer a single object; an ambiguous<id>errors unless--batchis given, and--dry-runreports the selection and total size first. Filtered bulk transfer reuses the selection model —datamanifest list <filters> --push/--pull <host>— rather than duplicating filters ontopush/pull.
Preservation contract¶
A conforming writer of language L MUST:
- Regenerate its own
[<dataset>._LANG.L]from its internal state; - Copy every other
[<dataset>._LANG.X](X ≠ L) verbatim, without parsing or reordering; - Regenerate its own top-level
[_LANG.L](config andloadersmap); - Copy every other top-level
[_LANG.X](X ≠ L) verbatim; - Preserve any
_-prefixed structural table that it does not own (_META, unknown future_*) verbatim; - Preserve legacy
[_LOADERS]verbatim if present and not explicitly migrated. - Preserve
[_STORAGE]verbatim unless it implements thestoragecapability, in which case it MAY regenerate its own_STORAGEentries (a shared, non-language-namespaced table).
Implementation pattern: a DatasetEntry keeps foreign _LANG.X subtrees and unknown
scalar keys in its extra; the Database keeps foreign top-level [_LANG.X] and
unknown _* tables in a database-level extra. Both splice back on write.
Conformance levels¶
A capability is a named feature that an implementation may support independently of others. An implementation declares the capability set it supports and runs only the fixture-suite tests tagged for those capabilities.
| Capability | Description |
|---|---|
lang-read |
Parse [<ds>._LANG.<lang>] and [_LANG.<lang>.loaders]; apply the load ladder. |
lang-write |
Regenerate own _LANG.<self> and preserve foreign _LANG.* verbatim on write (full lossless round-trip). |
shell-fetch |
Execute the dataset's shell command template in the fetch ladder. |
delegation |
Cross-language fetch (rung 3, the rare case): run a fetcher defined in another language — mechanism implementation-defined (call the language's runtime, or a peer datamanifest CLI), with fall-through to uri — controlled by delegate / --delegate (see Cross-language fetch, Peer-CLI contract). |
storage |
Honor the datasets_dir / datacache_dir folders, the optional datasets_pools / datacache_pools read pools, $-symbol resolution ($user_data_dir / $user_cache_dir / $repo + user-defined, host-aware via _HOST), and per-dataset storage_path (see Storage). |
byte-identity |
Emit the canonical lexicographic key ordering so the same logical manifest is semantically identical across tools — same keys, same values, same order at every level (verified by the cross-tool fixture). This is the guaranteed constraint. Literal byte-for-byte identity is not assured by default: current TOML writers differ in cosmetic formatting (indentation, blank lines, inline-vs-multiline arrays), so a one-to-one byte match is not always achievable. The Python tool is the normative reference for the canonical byte form; tools MAY offer an opt-in path to it (e.g. datamanifest format, or Julia write(...; canonical=true)). |
binding-args |
Execute the table form of a binding ({ ref, args, kwargs }): call ref(*args; kwargs...) with $var substitution in string values. |
cache-produce |
Cache-layer produce-or-load: function-backed (produced) datasets with parameter-hash keying, optional recipe version, the config.toml / metadata.toml sidecars, and the state file's datacache inventory, materialized under the datacache_dir folder (§Produced datasets). Declared by the cache layer, never by the core fetch capability (packaging — separate package or submodule — is unconstrained). |
inspect |
The user-driven store-inspection toolkit (§Maintenance): enumerate stored objects (datasets + cached, via the state file .datamanifest-state.toml) with their fields (kind, key/hash, location, referenced/orphan, dirty state, format, size, created, last-access), filter them, and act on a selection (delete, refresh, optional move). There is no automatic collector: deletion is always an explicit user selection; referenced/last-access are advisory. The reference CLI exposes it as datamanifest list … --delete. |
sync |
Cross-machine transfer (push / pull) of a stored object between two stores over SSH/rsync, addressed by its machine-independent identifier (name/alias/doi, or cachetype[/version]/hash); each end resolves its own store from env + _HOST ($repo excluded); writes no manifest; integrity via rsync; idempotent (§Cross-machine sync). |
Capabilities are independent — a partial implementation may ship lang-read and
lang-write without shell-fetch or delegation. The spec and its fixture suite are
never forked per language package; divergent per-language pace is expressed by each
implementation declaring its supported capability set and pinning to a spec tag.
_META.schema (the integer stored in the file) is the data-model compatibility version
and is bumped only on breaking structural changes. The spec-document version (git tag,
e.g. spec-v1.0) tracks prose and fixture evolution independently. An implementation
conforms to "schema N, spec ≥ vX" — these two axes are independent.
Peer-CLI contract¶
One way to do cross-language fetch (rung 3) is to call a peer-language datamanifest CLI.
This section is the normative invocation interface for any tool that does so. (A tool that
instead runs the foreign language's runtime directly does not use this contract.)
Invocation¶
<name>— the dataset key as it appears in the manifest.--datasets-toml <path>— absolute or project-relative path to the manifest file.--datasets-folder <dir>— (optional) directory that holds the shared download cache. If omitted, the tool's default cache location applies.
The peer tool resolves its own [<dataset>._LANG.<lang>].fetcher (using its own
fetch ladder), writes the result into the shared cache, verifies sha256 if
present, and exits non-zero on any failure. It produces no dataset bytes on
stdout — the artifact lands in the cache on disk and the calling tool reads it
from there.
Discovery and availability¶
Each language's CLI is discoverable on PATH under a language-specific name, e.g.
datamanifest (Python), DataManifest or datamanifest-julia (Julia). The Python
datamanifest CLI is the reference peer (the fallback target for cross-language fetch).
Before
delegating, a tool MUST probe that the peer CLI (and its runtime) is installed and
usable; if the probe fails, the delegation rung is silently skipped and the ladder
advances to rung 4 (uri download). Probe commands and PATH names are left to each
implementation to document.
Deprecations¶
The following v0 forms are still read for backward compatibility but SHOULD NOT be written by conforming v1 tools:
- Per-dataset language-named flat fields
julia=/python=/callable=(and any other<lang>=) — historically held inline code, which v1 forbids. Replaced by amodule:functionbinding under[<dataset>._LANG.<lang>].fetcher(or the bare, language-implicitfetcher). Kept verbatim in the dataset'sextraon read (no auto-rewrite, to avoid touching another language's data);migraterewrites the ref-shaped ones. A tool MAY emit a one-time deprecation notice. julia_modules/python_includes— retired; the manifest's directory is on the tool's import path by convention. Legacy*_includesvalues are still read as extra import-path entries for back-compat.
Note: bare fetcher / loader (language-implicit), bare shell (language-agnostic), and
top-level [_LOADERS] are not deprecated — they are supported forms (see
Language-implicit bindings and shell fetcher). Only the inline-code language-named
fields above are legacy.
An opt-in datamanifest migrate command (not normative in this spec) may rewrite a v0
flat file to v1 _LANG form for the tool's own language.
Conformance notes¶
- Readers MUST ignore unknown top-level tables and unknown fields rather than erroring,
so that new datasets, new
_LANGentries, and other tools' extension keys do not break an older reader. - Readers MUST preserve unknown
_*structural keys verbatim — not treat them as datasets, not drop them on write. - Writers SHOULD omit derived fields (
host,path,scheme) and any field left at its default value. - Writers MUST emit all keys, at every nesting level — top-level tables (structural
_*and datasets alike) and the fields within each table, including keys nested in inline{ }tables — sorted by Unicode code-point lexicographic order (the shared default of Pythonsorted()and JuliaTOML.print(sorted=true)). No table is special-cased (no_LOADERS/_META-first). This guarantees semantic identity across tools — the same logical manifest round-trips to the same keys, values, and ordering through either tool (the weaker constraint that is always met). It does not, by itself, guarantee byte-for-byte identity: the Python (tomli_w) and Julia (TOML.print) serializers differ in cosmetic formatting (indentation, blank lines, inline-vs-multiline arrays), and current tooling does not always permit a one-to-one byte match. For literal byte-identity the Python tool is the normative reference for the canonical form, and a tool MAY route its output through it opt-in (datamanifest format, or Juliawrite(...; canonical=true)). Note: because_(U+005F) sorts after uppercase but before lowercase ASCII letters, an uppercase dataset name sorts before the_*structural tables and a lowercase one after — intended; the canonical ordering is the requirement, not structural-table placement. uriandurisare mutually exclusive on a single dataset.- A file with no
[_META]section is read as schema v0 (legacy flat), leniently.