Skip to content

datamanifest.toml — manifest schema specification

This document is the normative description of the TOML manifest format shared by the DataManifest.jl (Julia) and datamanifest (Python) tools. A manifest declares the data dependencies of a project: each dataset's source URI, checksum, version, format, and how to fetch and load it. Either implementation can read and write a conforming file; each reads the language-agnostic contract fields plus its own _LANG-namespaced bindings, and preserves the rest verbatim.

Versioning

Two independent version axes govern this format:

  • _META.schema (integer, stored inside the file) is the data-model compatibility version. It increments only on a breaking structural change. A file without [_META] is treated as schema v0 (legacy flat) and read leniently. Current value: 1.
  • Spec-document version (a git tag such as spec-v1.0) versions the prose, examples, and fixture suite. An implementation conforming to "schema 1, spec ≥ v1.0" pins to a spec tag; the spec may advance without retroactively breaking a pinned implementation. The spec is never forked per language package: one normative document, one fixture suite, multiple implementations at varying capability levels.

Structural keys

Keys beginning with _ are structural — they are not dataset tables. At the top level, the defined structural tables are _META, _LANG, _STORAGE, and _LOADERS. Within a dataset table, the only defined structural sub-table is _LANG. Readers MUST preserve unknown _* keys verbatim and MUST NOT treat them as datasets or drop them on write.

Top-level layout

A v1 manifest is a TOML document with:

  • [_META] — schema metadata (schema = 1).
  • [_LANG.<lang>] — project-wide execution-context configuration for language <lang>. The sub-key loaders is a format → ref map of default loaders for that language.
  • One table per dataset, keyed by the dataset name. Dataset tables hold the language-agnostic contract fields, an optional _LANG sub-table for per-dataset bindings, and optional bare (language-implicit) fetcher/loader bindings (see Language-implicit bindings).
  • [_STORAGE] — optional storage configuration: the two folder fields (datasets_dir / datacache_dir), optional read-pool lists (datasets_pools / datacache_pools), reusable $-symbols (predefined $user_data_dir / $user_cache_dir / $repo plus user-defined keys), and the _HOST host-override sub-table. See Storage.
  • [_LOADERS] — a language-implicit format → binding loaders map (tolerated; the bare counterpart of [_LANG.<self>.loaders]). See Language-implicit bindings.

Example:

[_META]
schema = 1

[_LANG.python.loaders]
csv = "pandas.io.parsers:read_csv"
nc  = "xarray:open_dataset"

[_LANG.julia.loaders]
csv = "CSV:read"
nc  = "NCDatasets:Dataset"

[foo]
uri    = "https://example.com/foo.csv"
sha256 = "abc123"
format = "csv"

[bar]
sha256 = "def456"
format = "nc"
shell  = "make-bar -o $download_path"   # language-agnostic shell fetcher

[bar._LANG.julia]
fetcher = "MyPkg:build_bar"
loader  = "MyPkg:load_bar"

[bar._LANG.python]
fetcher = "mypkg.build:bar"
loader  = "mypkg.load:bar"

Language-agnostic contract (common fields)

Every field is optional and defaults to the empty string / empty list / false shown. Types are TOML types (string, array of string, bool).

Field Type Default Semantics
uri string "" Single source URI. HTTP(S), git/ssh+git/*.git, ssh/sshfs/rsync, file://, or an object-store scheme (s3://, gs://, gcs://, az://, abfs://, abfss://, adl://, gdrive:// — see Download schemes). Mutually exclusive with uris.
uris array of string [] Batch of source URIs written into a single dataset folder under disambiguated relative paths. Mutually exclusive with uri.
host string "" Parsed from the URI (derived; tools omit it on write).
path string "" Parsed from the URI (derived; tools omit it on write).
scheme string "" Parsed from the URI (derived; tools omit it on write).
version string "" Dataset version; participates in the storage key so multiple versions coexist on disk.
branch string "" For git sources: branch/tag to clone (--branch).
doi string "" DOI of the dataset; also usable as a search key.
aliases array of string [] Alternative names this dataset can be looked up by.
description string "" Human-readable description (replaces TOML comments).
key string "" Storage key (relative path under the datasets folder). Derived from host + path + version when absent.
storage_path string $datasets_dir/$key Path expression for where this dataset lives on disk, overriding the default. May interpolate $-symbols ($datasets_dir, $key, $user_data_dir, $scratch, …), $USER/env, and ~; relative ⇒ resolved against the project root. Containing $key ⇒ a tool-managed keyed location; an exact path without $key ⇒ a user-managed location used verbatim that maintenance never touches. Generalizes the former local_path and subsumes the former store. See Storage. Honored under the storage capability; other tools preserve it verbatim.
sha256 string "" Expected SHA-256 of the downloaded file/folder. Auto-filled on first successful download and verified at fetch time; not re-verified on every load (re-verification is opt-in).
skip_checksum bool false Disable checksum verification for this dataset.
skip_download bool false Management mode — treat the dataset as a passive, externally-managed dependency: it is not downloaded, not checksum-verified, and never moved or deleted by maintenance; the documented uri/path is returned as-is. For data the user provides and maintains (e.g. a large shared archive that should not be fetched over the network). Distinct from lazy_access — this is about who manages the bytes, not how they are read.
lazy_access bool false Access mode — access the dataset in place instead of materializing a local copy: the uri is handed to a loader that opens it where it lives (typically a remote object store), with no local copy, no checksum, and no state-file record. Requires a loader (a bare lazy_access with no loader is an error). The access mechanism (streaming, mount, FUSE, …) is implementation-defined — the spec fixes only that the bytes are not materialized. Distinct from skip_download (a management mode); the two are independent and not meant to combine.
delegate bool (run default) Force the cross-language fetch rung (rung 3) on (true) or off (false) for this dataset. When omitted, the tool's run-level default applies (--delegate / configuration). Honored under the delegation capability; other tools preserve it verbatim. See Cross-language fetch.
extract bool false After download, extract the archive (zip / tar / tar.gz) and use the extracted directory as the dataset path.
format string "" Data format hint used to pick a default loader (csv, parquet, nc, json, yaml, toml, md, txt, zip, tar, tar.gz, …). Inferred from the URI when absent.
requires array of string [] Names of datasets that must be downloaded before this one; defines a dependency graph resolved in topological order.
fetcher string | table "" Language-implicit fetcher binding — read as the running tool's own language (see Language-implicit bindings). Equivalent to [<dataset>._LANG.<self>].fetcher.
loader string | table "" Language-implicit loader binding — read as the running tool's own language. Equivalent to [<dataset>._LANG.<self>].loader.
shell string "" Language-agnostic shell fetcher — a command template run as a subprocess (the same command for every tool). Fetcher only; see shell fetcher.

Identifier resolution is exact-or-error. A dataset is looked up by its name, an alias, or its doi. When an operation must resolve to a single dataset, an identifier matching more than one dataset is a fail-loud error that names the candidates — never a silent first-match. This matters because a doi may be shared by several datasets (e.g. one archive split into parts), and acting on an arbitrary one of N is a correctness footgun. (The same rule governs sync addressing, where an ambiguous id requires an explicit --batch; see Cross-machine sync.)

Language bindings (_LANG)

Executable bindings live under a structural _LANG namespace, keyed by language tag (python, julia, r, …). The dataset table itself stays fully agnostic. (The language-agnostic shell fetcher is a bare shell field, not a _LANG tag — see shell fetcher.)

All executable references are module:function references — never inline code, in any language. A local module is importable because the manifest's directory (the project root) is on the language tool's import path by convention. There are no includes or modules fields in v1. A binding may additionally carry arguments as data (args / kwargs); these are passed to the referenced function and are never interpreted as code (see Binding forms).

Binding forms (string or table)

A binding is the single, unified concept used at every executable site — a per-dataset fetcher or loader ([<dataset>._LANG.<lang>]) and every entry in a project-wide [_LANG.<lang>.loaders] format map. It takes one of two interchangeable forms:

  1. string — a bare module:function reference; or
  2. table{ ref = "module:function", args = [...], kwargs = {...} }, with args/kwargs optional (see Parameterized bindings).

The string is an alias for the ref-only table — "M:f"{ ref = "M:f" } — so a reader MUST accept either form anywhere a binding is allowed. Call semantics follow the arguments, not the syntax: with no args/kwargs the tool makes its conventional call (a loader receives the dataset path; a fetcher the standard fetch context); with args/kwargs the call is explicitref(*args; kwargs...), nothing auto-injected — and runtime values are passed via $var substitution ($path, …).

Canonical writing. A binding with no args and no kwargs MUST be written as the string; the bare { ref = … } table is accepted on read but normalized to the string on write. A binding that carries args/kwargs is written as a table.

The shell fetcher is not a module:function binding — it is a language-agnostic command-template string (see shell fetcher) — so it is always a string, never a table.

Per-dataset bindings

[<dataset>._LANG.<lang>] holds singular bindings for a specific dataset in language <lang>. Both keys are optional, and each is a binding in either form (see Binding forms).

Key Type Semantics
fetcher string | table module:function ref (or { ref, args } table) called to produce the dataset bytes, instead of (or in addition to) downloading the uri.
loader string | table module:function ref (or { ref, args } table) called to load the dataset into memory, overriding the format default.

Parameterized bindings (ref + args / kwargs)

A binding may be written as a table so one function is reused across datasets that differ only in arguments — the same loader called with grid = "5x5" for one dataset and grid = "10x10" for another:

[esm_5x5._LANG.julia.loader]
ref    = "MyPkg:load_esm"
args   = ["$path"]                                     # positional, in order
kwargs = { grid = "5x5", skip_models = ["CESM.*"] }    # keyword

[esm_10x10._LANG.julia.loader]
ref    = "MyPkg:load_esm"
args   = ["$path"]
kwargs = { grid = "10x10" }
Key Type Semantics
ref string The module:function reference (required).
args array Positional arguments, in order (optional).
kwargs table Keyword arguments (optional).
  • args and kwargs are plain data (string, number, bool, array, table) — never code. args is an ordered list of positional values; kwargs keys become keyword parameters. Values map to each language's native types.
  • A binding carrying args/kwargs is called explicitly: the tool calls ref(*args; kwargs...) and does not auto-inject any standard value. Runtime values are referenced by $var substitution in string values — the same variables the shell fetcher exposes ($key, $version, $doi, $format, $branch, $uri, $project_root; $download_path for fetchers, $path — the resolved dataset path — for loaders). The ref-only form (the bare string, equivalently { ref = … }) instead makes the tool's conventional call and is written as the string (see Binding forms).
  • Type mapping is language-neutral. A value with no TOML type — e.g. a Julia Symbol — is written as its plain string form (weighting_method = "model" for :model); the target function accepts the string (or coerces it at its boundary). A binding's arguments MUST be representable as TOML data.
  • A tool that executes <lang> bindings but does not implement the binding-args capability MUST error when it encounters args/kwargs, rather than silently calling the function without them (which would change results). The bare-string form requires no such capability.
  • For canonical serialization, kwargs keys are emitted in lexicographic order like all other keys (including inside an inline { } table); args is an ordered array, so its element order is preserved as data (arrays are never reordered). Both therefore carry the same key order and element order across tools — semantically identical (and byte-identical via the canonical reference form; see the byte-identity capability).

shell fetcher

shell is a bare, language-agnostic dataset field: a command template run as a subprocess to fetch the dataset. Unlike a bare fetcher/loader (language-implicit — the running tool's own language), shell is the same command for every tool, so it belongs on the dataset table, not under the language namespace. It is a fetcher only — a subprocess cannot return a live in-memory object, so there is no shell loader. The value is a command template supporting variable substitutions: $download_path, $project_root, $uri, $key, $version, $doi, $format, $branch, $path_<ref>, $path_<i>, $requires_paths.

The bare shell field is the canonical (and only) form; the former [<dataset>._LANG.shell].fetcher is not part of the spec.

Project-wide loaders

[_LANG.<lang>.loaders] is a format → binding map of project-wide default loaders for language <lang>: each value is a binding in either form (a bare module:function string, or a { ref, args, kwargs } table — see Binding forms), so a format default may be parameterized exactly like a per-dataset loader. It applies when a dataset has no per-dataset loader for that language. Note the singular loader key per dataset vs. the plural loaders format map at the top level.

Language-implicit bindings (bare fetcher / loader)

For a single-language project the [<dataset>._LANG.<lang>] wrapper is needless ceremony. A dataset table MAY therefore carry a bare fetcher and/or loader directly (a binding in either form), and a top-level [_LOADERS] table MAY carry a bare format → binding map. "Bare" means language-implicit: a reading tool interprets these as bindings in its own language, exactly as if they appeared under [<dataset>._LANG.<self>] / [_LANG.<self>.loaders].

  • Precedence — explicit wins. An explicit own-language binding overrides the bare one: [<dataset>._LANG.<self>].loader > bare loader, and [_LANG.<self>.loaders][fmt] > [_LOADERS][fmt] (likewise for fetcher).
  • Strict — fail loud. A bare binding is present for the running language (bare = the running language), so it is treated exactly like an explicit [<dataset>._LANG.<self>] binding: if it fails to resolve it is an error, and if it resolves and then raises at run time the error propagates — never a silent fall-through to a different loader/fetcher (which could hand a program wrong-shaped data behind only a warning). The ladder falls through only for bindings that are absent for the running language. A manifest meant to be read by more than one language uses explicit [<dataset>._LANG.<lang>] bindings (absent — and so correctly skipped — in the other languages); sharing a bare binding across languages and expecting the others to ignore it is unsupported. (A tool-wide best-effort mode — e.g. "fetch everything that succeeds, skip the rest" — is a separate concern, out of scope for this rule and not introduced here.)
  • Preserve verbatim (round-trip). A writer MUST keep a bare binding bare — it MUST NOT promote loader = … into [<dataset>._LANG.<self>].loader. A tool writes under _LANG.<self> only for bindings it generates itself, so hand-authored bare bindings survive a read-write round-trip unchanged and one tool never rewrites another language's view.

[_LOADERS] was previously a deprecated back-compat table; it is now tolerated as the language-implicit counterpart of [_LANG.<self>.loaders] — read as the running tool's format-default loaders and preserved verbatim on write. The shell fetcher is the language-agnostic sibling of these language-implicit bindings: a bare dataset field carrying the same command for every tool (see shell fetcher).

Resolution semantics

At runtime, each language tool collapses the _LANG tree to a single effective fetcher and loader for each dataset. The full _LANG tree is retained internally for lossless round-trip.

Fetch ladder

The tool tries each rung in order, using the first that applies:

  1. [<dataset>._LANG.<self>].fetcher, else the bare [<dataset>].fetcher — in-process call (own language, fastest);
  2. the dataset's shell command — run the command template (cheap subprocess);
  3. cross-language fetch — the rare case: run a fetcher defined in another language (mechanism implementation-defined; the Python CLI can serve as a fallback), controlled by delegate / --delegate; see Cross-language fetch below;
  4. plain uri download (if uri is set) — dispatched by scheme (see Download schemes);
  5. else error.

Download schemes

The plain-uri rung dispatches on the URI scheme. The spec fixes the scheme set and its semantics — "fetch the named object, then verify sha256 as usual" — but not the mechanism: each implementation fetches with whatever backend fits the language.

Scheme(s) Fetch
http / https streaming GET
git / ssh+git / https://*.git shallow clone (--branch honors branch)
ssh / sshfs / rsync rsync over SSH
file:// copy (or rsync from a remote host)
object storess3://, gs://, gcs://, az://, abfs://, abfss://, adl://, gdrive:// fetch the object from the named store, then verify sha256
  • Object-store schemes are normative, but mechanism-agnostic: a tool MAY implement them with any backend (the Python tool uses fsspec behind an optional extra; a peer tool uses its own packages). A tool that cannot serve a scheme delegates it (cross-language fetch) or errors with unsupported scheme — it MUST NOT silently skip it.
  • HTTP/HTTPS are deliberately not in the object-store set — they keep their own dedicated GET path.
  • A uri fetched by any scheme is sha256-verified like any other download; an object-store URI is just another source of bytes. (To open such a URI without downloading, set lazy_access; see the field table.)

Load ladder

The tool tries each rung in order:

  1. [<dataset>._LANG.<self>].loader, else the bare [<dataset>].loader;
  2. [_LANG.<self>.loaders][<dataset>.format], else [_LOADERS][<dataset>.format] — manifest-configured format default;
  3. the tool's built-in default loader for <dataset>.format;
  4. else error.

At each own-language rung the explicit _LANG.<self> binding takes precedence over the bare one. A binding that is present for the running language (bare, or explicit _LANG.<self>) but fails to resolve is an error; one that resolves and then raises propagates. The ladder falls through only to skip rungs that are absent (another language's _LANG.<other> fetcher, or no own loader), never to paper over a broken present binding (see Language-implicit bindings).

Load never delegates. A loader returns a live in-memory native object, which cannot cross a process boundary. Cross-language data preparation is modeled as one language's fetcher writing a normalized artifact (Arrow/parquet/netcdf) that another language then loads with its own format default.

Cross-language fetch (rung 3)

Reached only in the rare case that a dataset has no fetcher in the running tool's own language, no shell command, and no uri — its bytes can be produced only by a fetcher defined in another language ([<ds>._LANG.<other>].fetcher). Native / shell / plain uri cases never reach here, so each implementation is self-sufficient for nearly all datasets.

How a tool runs a foreign fetcher is implementation-defined. It MAY invoke that language's runtime directly (e.g. julia --project=<env> -e '…', writing to $download_path and materializing the result itself), MAY delegate to a peer-language datamanifest CLI (see Peer-CLI contract), or MAY skip the rung. The Python implementation is the reference and aims to cover every language, so a tool with no native way to run a foreign fetcher can simply call the Python CLI as a fallback.

Either way it moves bytes on disk only (load never crosses languages); a tool MUST fall through to uri when the needed toolchain is absent; and it applies to fetched datasets only — produced (@cached) datasets are not cross-language. Gated by the delegation capability; the delegate field / --delegate toggles it.

Storage

Storage reduces to two paths: where fetched datasets go, and where the produced cache goes. Both are set in [_STORAGE] and default to local, repo-relative folders, so a casual user gets visible ./datasets/ and ./cached/ with zero configuration and nothing derived:

[_STORAGE]
datasets_dir  = "datasets"     # fetched datasets (default; relative => <repo>/datasets/)
datacache_dir = "cached"       # produced cache   (default; relative => <repo>/cached/)
  • Relative ⇒ relative to the project root (the manifest's directory, $repo). Absolute, ~-, or $symbol-rooted paths are used as written.
  • Resulting paths are flat: a fetched dataset lands at <datasets_dir>/<key>; a produced artifact at <datacache_dir>/<cachetype>/[<version>/]<hash>/ (see Produced datasets and caching). No partition, no prefix, no derived name in between — the folder you set is the location.

Storage is a portable location layer: the symbols below resolve to the same place in every implementation, so peer tools (Python / Julia) share an on-disk location without configuration. The core knows locations only — no lifetime policy; disposability and GC of produced datasets are the cache layer's concern (see Produced datasets and caching), not the core fetch engine's. [_STORAGE] and its _HOST sub-table are defined structural keys: a conforming tool parses them identically and preserves them verbatim (a tool without the storage capability treats the whole table as a preserved unknown).

Read pools (reuse what is already on the machine)

Two optional [_STORAGE] list fields name read-only locations to reuse before materializing, so an object another project already has is not fetched or recomputed again:

[_STORAGE]
datasets_pools  = ["~/.cache/Datasets", "$user_data_dir/datamanifest/datasets"]
datacache_pools = ["$team/cache"]        # opt-in; no default
  • datasets_pools — fetched datasets. Resolution probes each pool (at the dataset's keyed sub-path) after the recorded and directive-derived locations and before downloading. On a hit the declared sha256 is verified (a mismatch is skipped, never trusted), the location is recorded in the state file, and the bytes are used in place — no copy. A genuine download still goes to datasets_dir (the directive — gold standard). Default when the field is absent: the built-in well-known pools (~/.cache/Datasets, $user_data_dir/datamanifest/datasets); an explicit list is used verbatim; an empty list disables pools.
  • datacache_pools — produced artifacts. Symmetric: the produced-dataset hit-search also probes each pool at <pool>/<cachetype>[/<version>]/<hash>, gated by the usual config.toml validation, and self-heals the record on a hit. It is opt-in — absent means no pools (and no built-in default): there is no de-facto shared compute location, and a produced artifact carries no content checksum (only its cachetype/version/hash identity), so cross-project adoption must be deliberate.
  • Read-only and host-local. Pools are never written to and never edit datasets.toml; they only add candidate locations to read-resolution. They are resolved like any other path expression ($-symbols, ~, env) and are host-composable via [_STORAGE._HOST] (a pool list set per hostname glob); environment overrides are DATAMANIFEST_DATASETS_POOLS / DATAMANIFEST_DATACACHE_POOLS. Honored under the storage capability; other tools preserve the fields verbatim.

Symbols

A folder path may interpolate $-symbols: $NAME / ${NAME} expands to a defined symbol, else to the environment variable NAME; ~ expands to home. A symbol is defined by a bare key and referenced with $.

Predefined (platform-resolved; the names map directly to platformdirs):

Symbol Resolves to
$user_data_dir platformdirs.user_data_dir() — the machine's user data dir (persistent)
$user_cache_dir platformdirs.user_cache_dir() — the machine's user cache dir (reclaimable)
$repo the project root (the manifest's directory); the base for relative paths

They are bare — no datamanifest/app name is appended; you namespace explicitly (datasets_dir = "$user_data_dir/myproj"). Every implementation MUST resolve them to the identical path and MUST NOT substitute a language-native location (e.g. a package depot), so Python and Julia agree. $USER and any other environment variable are available too.

User-defined — any other bare key under [_STORAGE] is a reusable symbol, and may be made host-specific in [_STORAGE._HOST."<glob>"] (matched against the hostname). The two fields themselves may also be host-specific:

[_STORAGE]
datacache_dir = "$scratch/cache"        # reference a custom symbol
scratch          = "/scratch/$USER"        # its default definition

[_STORAGE._HOST."login*.hpc.edu"]          # hostname glob / regex
scratch          = "/work/$USER"           # host-specific value
datasets_dir  = "$user_data_dir/shared" # a field, host-specific

Resolution ladder (symbol or field; first match wins): DATAMANIFEST_<NAME> environment variable → matching [_STORAGE._HOST.<glob>].<name> → base [_STORAGE].<name> → (predefined symbols only) the platformdirs / project-root default. Host-specificity always lives in the symbol's resolution, never a per-dataset host map.

To centralize and share across clones, branches, or projects, point the two fields at a machine dir under a name you choose — one self-documenting edit, nothing derived:

[_STORAGE]
datasets_dir  = "$user_data_dir/myproj"
datacache_dir = "$user_cache_dir/myproj"

Environment

Exactly two environment variables override the fields — for HPC / CI / containers where editing the manifest is inconvenient: DATAMANIFEST_DATASETS_DIR and DATAMANIFEST_DATACACHE_DIR. (User-defined symbols override as DATAMANIFEST_<NAME>.)

Per-dataset path (storage_path)

A dataset MAY override where it lives with the storage_path field — a path expression that defaults to $datasets_dir/$key ($key is the dataset's storage key, see key). It generalizes the former local_path (which was exact-only) and subsumes the former store; the only distinction was whether the key is appended:

  • contains $key ⇒ a tool-managed, keyed location; the dataset is materialized there and maintenance MAY act on it. storage_path = "$scratch/$key" parks one heavy dataset on scratch.
  • an exact path without $key ⇒ a user-managed location, used verbatim, bypassing the keyed layout — and maintenance never touches it. storage_path = "$cmip/AMIP/tas.nc" points at a file you manage; with $cmip resolved host-specifically, that is the heavy-archive-per-host pattern (host-specificity in the symbol, never a per-dataset host map).

(The name path is not used for this — it is already the URI's parsed path component.)

In-memory and multiple manifests (library use)

A manifest is a logical structure; a datasets.toml file is its canonical serialized form, but a tool MAY construct and hold one in memory, and several MAY be live at once in one process — each resolving its own datasets, bindings, and storage independently (an in-memory manifest resolves identically to a file-backed one). The construction surface is per-language and non-normative (Python's Database, Julia's Database); whether and where such a manifest is persisted is the author's choice.

This is how a library owns its own data without touching the end user's datasets.toml: it builds its own manifest, declares its datasets via the language API, and sets its datasets_dir / datacache_dir explicitly (e.g. under $user_data_dir/<library>) — a relative default would otherwise resolve against the end user's project, never the installed library's. Recommended: a library shipping data dependencies should own them this way (its own manifest + explicit folders); without explicit folders, its data correctly falls back to the end user's project, who owns the location.

Concurrent access and completeness

A folder may be shared between tools and between concurrent processes (e.g. HPC jobs), so materialization MUST be safe under concurrency, and peer tools sharing a folder MUST agree on these conventions:

  • Atomic publish. Materialize into a temporary path within the same store partition as the final path (e.g. <key>.tmp beside <key>) and atomically rename it into place, so a killed process never leaves a partial entry that looks complete. Staging is never a central/global cache: a download (or produce) destined for $scratch stages on $scratch — required for the rename to be atomic (same filesystem) and essential for voluminous data, which must never transit a small ~/.cache first.
  • Completion marker. An entry is complete iff its marker exists — <key>/.complete for a directory, <key>.complete for a file. Readers MUST treat an entry without its marker as absent (re-fetch); a writer MUST create the marker only after a successful, verified materialization.
  • Lock. A writer SHOULD hold an exclusive lock <key>.lock (a pidfile; a lock whose PID is dead and older than a grace period MAY be reclaimed) while materializing, so concurrent workers neither recompute nor clobber the same entry.

Produced datasets and caching (companion layer)

Spec-v2.1 — a companion layer, not a core capability. The produce-or-load (@cached) layer sits outside the core fetch engine as a distinct capability layer built on the shared substrate it reuses (safe-materialization, folder resolution, loaders). This section is the cross-tool format spec for that layer — it stays in this document so both languages agree on the on-disk shape.

Packaging is not constrained by this spec. Whether an implementation ships the layer as a separate package that depends on the core, or as an optional module of the same package, is the implementation's choice (spec-v2 over-specified this as a separate package; spec-v2.1 relaxes it). What the spec fixes is the boundary: the cache-produce / inspect capabilities are never declared by the core fetch capability, and the core fetch engine keeps no garbage collection and no disposability policy.

The format is additive over a datasets.toml: it adds no field to the hand-authored datasets.toml and does not change its _META.schema (still 1). A produced dataset reuses the existing engine — the storage model, the safe-materialization primitive, and the load ladder — but is not declared in datasets.toml; its only on-disk record is the machine-generated config.toml / metadata.toml sidecars (each _META.schema = 1) and the project's state file (_META.schema = 5). The format is gated by two independent capabilities, cache-produce and inspect, so a companion may ship neither, one, or both.

A produced dataset is one whose bytes come from running a project function rather than downloading a uri. It is the same "recipe + key + store + policy" object as a fetched dataset; the distinction is purely two slots:

  • the recipe — a uri/shell/git recipe with a source-identity key, versus a function recipe with a parameter-hash key;
  • the authorship — a fetched dataset is hand-authored in datasets.toml; a produced dataset is machine-generated.

Everything else — the storage folders, the safe-materialization primitive, loaders, the preservation contract — is recipe-agnostic and applies unchanged. The only new normative axis is parameter-hash keying and its on-disk bookkeeping.

A produced dataset has no entry in datasets.toml. It originates from a function that a tool exposes through its produce-or-load surface (the @cached decorator / macro — per-language and non-normative; see below), and is recorded only after it runs. cachetype is therefore not a datasets.toml field; it is a namespace that appears solely in the machine-generated records — the state file's datacache entry, the config.toml [_META] block, and the on-disk path. A conforming fetch path (download_dataset and the fetch ladder) never encounters a produced dataset: the two concerns share the engine, not the manifest. This keeps datasets.toml clean (the hand-authored, git-committed spec) and confines produced, parameter-hash-keyed churn to the git-ignored state file (regenerable ground truth).

A produced dataset is identified by its keyword parameters, not by content: its storage key is <cachetype>/<param-hash>, its parameters are the hash inputs, and the config.toml sidecar is the re-checkable record of those inputs (it is not content-pinned by a manifest sha256). It is materialized under datacache_dir (default cached, relative ⇒ local ./cached/), the producing layer's folder.

Keyword-only. Because the parameters double as identity, the producing function is keyword-only for hashing: an ordered positional argument list has no stable name→value identity to hash, so a cache-produce tool MUST derive the key table from keyword parameters only. This is a property of the produce-or-load surface, not a datasets.toml rule — fetched datasets keep their positional args per spec-v1.1.

Parameter-hash keying

A produced dataset's identity is (cachetype, hash-of-its-parameters). The hash inputs are the producing function's hash-affecting keyword parameters as a key table: a mapping parameter name → value. Parameters split three ways (normative — the split, not the exact source mapping):

Class In the hash? Stored where Example
hash-affecting params yes config.toml sidecar grid = "5x5"
runtime knobs (_-prefixed keys) no nowhere (transient) _parallel = true
audit-only extras no metadata.toml sidecar producing git commit

How a tool derives the key table from a function's declared keyword parameters (signature introspection, an explicit key selector like LGMIO's key=(args -> (;…)), etc.) is implementation-defined; the serialization and hash are normative so the same parameters yield the same key everywhere:

  1. Build the key table from the hash-affecting keyword parameters, excluding every key whose name begins with _ (those are runtime knobs).
  2. Serialize it to canonical JSON (JCS, RFC 8785): object members sorted by Unicode code point at every nesting level, no insignificant whitespace (member separator ,, name separator :), UTF-8 output with minimal JSON string escaping. To keep canonicalization unambiguous, hash-input values are restricted to strings, integers, finite floats, booleans, and arrays/objects composed of those. A finite float serializes through this same canonical-JSON projection — the Python reference json.dumps float form is normative (1.01.0, 0.10.1) and a non-Python tool MUST reproduce it byte-for-byte. NaN and ±Inf are disallowed (no JSON representation), as are nulls (an absent parameter is omitted, not encoded as null). A float-valued knob MAY still be passed as a string when maximal cross-tool hash stability matters, since float formatting is the most implementation-sensitive case. Array element order is significant (arrays are data); object key order is not (sorted).
  3. The parameter hash is the lowercase hex SHA-256 of those canonical UTF-8 bytes.
  4. The storage key is "<cachetype>/<hash>" — or "<cachetype>/<version>/<hash>" when an optional recipe version is set (see Produced-artifact location). (Tools MAY display a short hash prefix, but the on-disk directory and all references use the full 64-hex digest.)

Canonical JSON (rather than TOML) is the hash input precisely because it has a fully-pinned byte form that Python (json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=False)) and Julia produce identically today for strings, integers, booleans, and their arrays/objects, independent of the cross-tool TOML byte-identity work (for finite floats the Python form is the normative reference a non-Python tool reproduces). So a produced dataset resolves to the same <cache_root>/<cachetype>/<hash> path under either tool (even though the artifact bytes a given tool writes there may be language-specific; cross-tool loading of a produced artifact is not implied, only cross-tool addressing and maintenance). The config.toml sidecar stores the same key table in human-readable TOML; the hash is over its canonical JSON projection, not over the TOML bytes.

Identity: the cachetype namespace

cachetype is the disambiguation namespace that, paired with the parameter hash, identifies a produced artifact. It has a default and an override:

  • Default — the producing function's canonical importable name. Absent an explicit cachetype, a tool MUST derive it from the function's fully-qualified, importable name in the host language (Python: module.qualname, e.g. mypkg.analysis.produce; the cross-language rule is "the canonical name by which the runtime re-imports that callable"). An explicit cachetype = "<name>" override remains — for a stable hand-chosen name, or to deliberately group several functions under one namespace. The auto and explicit forms share one namespace: an explicit cachetype equal to the derived name denotes the same identity. (So the default cachetype coincides with the state file's datacache entry ref — by default the namespace is the producer's identity.)
  • Why unique-per-function is the right default. The worst cache failure is silently mixing unrelated results under one key, so the default must be unique per producing function. The accepted, prominently documented consequence: renaming or moving the function (or restructuring its package) changes its cachetype and orphans the prior artifacts — the correct behavior, since the code identity that produced them is gone. version remains the tool for deliberate busting; re-pinning an explicit cachetype is the tool for deliberate continuity across a rename.
  • No stable identity ⇒ require an explicit cachetype. When the producing function has no stable importable identity — a top-level script, a REPL, an eval-string, a notebook — a tool MUST NOT synthesize an ambiguous cachetype; it MUST require an explicit cachetype and error otherwise. This is the same constraint native serialization already imposes (an object defined in the entry-point module has no portable qualified name). (Python reference, non-normative: a __main__ function is resolved via the launch's recorded module identity — __main__.__spec__.name, which Python sets for python -m pkg.modpkg.mod but leaves None for a loose script, -c, the REPL, and notebooks — so those require an explicit cachetype.)

Identity conflicts (same (cachetype, version), same process)

Because cachetype can be set explicitly, two distinct producing functions can be made to claim the same namespace. A tool SHOULD guard the dangerous case: if two distinct functions claim the same (cachetype, version) pair while simultaneously live in one process, it SHOULD raise immediately and name both, rather than let them silently share a key.

  • The key is the pair. The same cachetype with different versions is a valid, tolerated case (e.g. two functions running calibration v1 and v2 at once). Storage location is irrelevant to the check — a cachetype must be unique regardless of where its artifacts are written.
  • The guard is intentionally same-process / same-time. Equally-named functions used in separate runs simply share the slot, which is permitted — a user may engineer that, or split modules / set explicit cachetypes to keep them apart. There is no static cross-process check, and none is wanted: "live in one process at the same time" is exactly the boundary where a collision is unambiguously a mistake.
  • Transient functions are exempt. Nested, local, or anonymous functions (a closure, or a function defined inside another) are dynamic and short-lived, so they are exempt from the guard — and they typically lack a stable importable identity, so they already require an explicit cachetype (above) to be cached at all.
  • The mechanism is implementation-defined. (Reference, non-normative: a process-local registry populated at decoration/registration time — keyed by the function's ref so a re-import overwrites rather than duplicates — with no disk writes; a tool that never imports the user's code never sees the recipes, and separate processes keep separate registries, so the guard never entangles caches across projects.)

When and how to detect is the implementer's call. A tool SHOULD run the check at the earliest reasonable and practical point for how its host language loads and runs code. In a dynamically-loaded language (Python) that is decoration / import time. In a language that mixes ahead-of-time precompilation with live execution (notably Julia) the right point is less obvious — registration done at top level may run during precompilation rather than in the user's session — so a faithful, low-surprise guard might instead belong at first call or a runtime-init hook, or, if precompilation makes a sound guard impractical, be narrowed or omitted. The spec fixes only the semantics — two distinct functions, the same (cachetype, version), simultaneously live — never the timing or mechanism; that is left to each language's implementer to realize (or to judge infeasible), which is why the guard is a SHOULD.

Produced-artifact location

A produced artifact composes its path under the datacache_dir (see Storage):

<datacache_dir>/<cachetype>/[<version>/]<hash>/

The produce surface resolves datacache_dir from the same [_STORAGE] as fetched data — it MUST read the nearest discovered manifest's [_STORAGE] (the same upward walk used to find the project root; a plain TOML read, no fetch layer), so produced and fetched data share one storage configuration. By default datacache_dir = "cached" (relative ⇒ local ./cached/), so produced artifacts are local and visible like everything else; centralizing/sharing is the explicit datacache_dir = "$user_cache_dir/<name>" edit (Storage). There is no scope, prefix, or partition in the path — the folder you set is the location.

  • version — an optional recipe/code version segment (below).
  • An explicit per-call location — given per @cached call (cache_dir = …) — is used verbatim (<cache_dir>/<cachetype>/[<version>/]<hash>), the experiment-folder workflow.

The location affects location only — never hit validity, which is the key (cachetype + hash) alone.

Recipe version (optional). A producing call MAY carry a short version string (e.g. "v3", a date). When set it becomes a path segment between cachetype and hash (<cachetype>/<version>/<hash>) and is recorded in the config.toml sidecar and the state file's datacache entry (in the recipe key, <cachetype>@<version>). version does not enter the parameter hash; it is an explicit, human-set recipe/code version, orthogonal to the parameter key and distinct from _META.schema (the on-disk format version). Its purpose is correctness under sharing: a change to a function's logic that leaves its parameters unchanged would otherwise read a stale hit from another branch or clone — bumping version forces a miss. When unset, the key stays <cachetype>/<hash>.

Cache layout and sidecars

A produced artifact is materialized at the composed path above — <datacache_dir>/<cachetype>/[<version>/]<hash>/ (see Produced-artifact location) — via the same safe-materialization primitive (atomic publish, .complete marker, .lock pidfile) as any other store write. The directory is self-describing through two sidecars written next to the artifact:

…/<cachetype>/[<version>/]<hash>/
├── <basename>.<ext>      # the produced artifact (format-determined)
├── config.toml           # the re-hashable hash inputs (the key table)
├── metadata.toml         # provenance / audit (never hashed)
└── .complete             # completion marker (file form: <hash>.complete alongside)

config.toml (cache-produce) — the key table verbatim plus a [_META] block, so any tool can recompute the hash and confirm the directory's identity. The key table is written at the root and first (TOML requires root-table keys to precede any table header), so [_META] comes last; reading back, the key table is every root key except the [_META] block:

# --- hash-affecting parameters (the key table) ---
grid        = "5x5"
skip_models = ["CESM.*", "FGOALS.*"]

[_META]
schema    = 1
cachetype = "esm_20c_anomaly"
# version = "v3"   # optional recipe/code version (becomes a path segment when set)
# hash = SHA-256( {"grid":"5x5","skip_models":["CESM.*","FGOALS.*"]} ), canonical JSON:
hash      = "83425a30d111562d46c1fce9de7618ea7f1f54e1be72e086cba0ac63c6f2ce9b"

(83425a3… is a verifiable reference vector: it is the SHA-256 of the canonical JSON {"grid":"5x5","skip_models":["CESM.*","FGOALS.*"]}. Every conforming cache-produce implementation MUST reproduce it.)

A tool with cache-produce MUST be able to recompute the hash from config.toml's key table and MUST treat a directory whose recomputed hash ≠ _META.hash as not a valid cache hit (re-produce).

Because format is a serialization choice and not a hash input, several formats of the same computation share one <cachetype>/[<version>/]<hash> directory (a data.<ext> per format). A hit is therefore valid only when the data file for the requested format is present: a complete, hash-valid directory whose data.<ext> for this format is absent recomputes (writing that format) rather than failing — so two recipes that share a cachetype and hash to the same key but emit different formats coexist instead of colliding.

metadata.toml (cache-produce) — provenance only, never an input to the hash and never an authority for cache validity:

[_META]
schema = 1

created = "2026-06-02T15:04:05Z"        # RFC 3339 UTC
tool    = "datamanifestpy 0.17.0"        # producing tool + version
host    = "login3.hpc.edu"
user    = "mahe"

[git]
commit = "1f8839c…"
branch = "main"
dirty  = false

[origin]
state_file = "/home/mahe/proj/.datamanifest-state.toml"   # the state file that inventories this artifact

Default serialization format (per language)

A produced dataset MAY omit format. When it does, a tool serializes the returned value with its language-native default format and reads it back with the matching built-in loader — so a bare return value round-trips with no configuration. This default is per language and RECOMMENDED, not normative (native serialization is language-private and version-sensitive; the spec pins cross-tool addressing, not the blob), but each conforming cache-produce tool SHOULD define one and SHOULD ship both the saver (value → bytes) and the matching built-in loader (bytes → value):

Language RECOMMENDED default format saver / loader
Python pickle (data.pickle) pickle.dump / pickle.load
Julia jld2 (data.jld2) JLD2.save / JLD2.load

An explicit format always overrides. Bytes in a language-native default format are not cross-language-loadable by construction (pickle is Python-only, jld2 Julia-only) — which is consistent with the spec pinning only cross-tool addressing and maintenance, never the blob format.

The state file (.datamanifest-state.toml)

The hand-authored datasets.toml stays clean — it is the spec: what to track and how to obtain it (a dataset's uri/fetcher/shell, a @cached function's code), hand-authored and git-committed, the source of intent. What it records is the expectation — a per-dataset directive storage_path (where bytes should go) and a contract sha256 (what they should hash to).

Where each object actually landed on this machine is recorded separately, in a sibling .datamanifest-state.toml — the state file: a tool-maintained, per-object inventory of resolved on-disk locations. It is the machine analogue of a lockfile, except it is not a committed reproducibility lock — it says nothing about how to re-obtain a resource (that lives in the spec), so it is no help to a fresh clone; it is regenerable local state, and therefore git-ignored by default. (Produced artifacts and many fetched datasets live on one machine, often outside the repo, and cannot be pulled from the internet — so committing the inventory would only record paths no other clone can use. A project whose data sits on a shared drive every clone can reach MAY choose to track it, but that is a user's setup, not the design intent.) The leading dot marks it as a CLI-read dotfile, not a hand-edited document.

The state file inventories both kinds of materialized object — fetched datasets and produced artifacts — under two top-level namespaces (datasets and datacache) parallel to the two storage folders, so the two never collide and each is greppable on its own. _META.schema = 5:

[_META]
schema = 5

# --- fetched datasets: storage key → resolved location (+ actual checksum) ---
[datasets."example.com/foo.nc"]
storage_path = "datasets/example.com/foo.nc"   # where the bytes actually are
sha256       = "abc123…"                         # actual digest; omitted under skip_checksum

# --- produced artifacts: cachetype[@version] → instances{hash → location} ---
[datacache."lgmpre.data.load_20c@v3"]            # bare cachetype when unversioned
ref    = "lgmpre.data:load_20c"                  # producing module:function (refreshed across a refactor)
format = "nc"

  [datacache."lgmpre.data.load_20c@v3".instances]
  "83425a30d111562d46c1fce9de7618ea7f1f54e1be72e086cba0ac63c6f2ce9b" = "cached/lgmpre.data.load_20c/v3/83425a30…"
  • datasets — fetched. Keyed by the dataset's existing storage key (host/path[#version], its machine-independent identity — no new id). Each entry records the resolved storage_path (where the bytes actually are) and the actual sha256 of what is on disk (omitted when the dataset sets skip_checksum, so a very large file need never be hashed).
  • datacache — produced. One table per recipe, keyed by <cachetype>[@<version>]@ is a reserved version separator, so a bare key is the unversioned recipe and two versions of one cachetype never collide (a cachetype MUST NOT contain @, which module.qualname never does). It carries recipe-level ref / format and an instances table mapping each produced variation's parameter hash to the full artifact directory it was written to. The params are not stored here — they live in each artifact's config.toml sidecar (read at enumeration), so a single artifact's recorded location is its own complete record.
  • Spec vs. state — the same two fields, two meanings. A dataset's storage_path and sha256 appear in both files on purpose: in datasets.toml they are the expectation (the directive for where bytes go / the contract digest); in the state file they are ground truth (the resolved location / the actual digest). The duplication is intentional and harmless — the state file is derived and disposable. (Fully separating directive-from-resolved and expected-from-actual is a future cleanup; see ROADMAP.md.)
  • The state file is a defined structural sibling format with its own _META.schema. A tool that implements neither inspect nor produced caching need not read it. Earlier shapes (_META.schema 1–4: the produced-only flat and nested cached.toml forms) are still read and migrated forward, and cached.toml is the recognized legacy name. A produced artifact's metadata.toml carries a state_file back-pointer to the file that inventories it (audit only).

The state file is read-only inventory (the gold standard)

The guiding invariant: the state file records where things are and is consulted to find an existing object — it never directs a write. Every (re)materialization follows the current directivedatasets_dir / datacache_dir, a per-dataset storage_path, an explicit @cached(storage_path=…) — which is the gold standard; the recorded location only short-circuits the lookup.

  • Read-resolution checks the state file first. Resolving where an object lives consults the recorded storage_path ahead of any derivation rule: if the bytes are actually there (and, for a dataset, checksum-valid when a digest is recorded), that is a hit — no re-derive, no re-download, no recompute. Only on a miss does resolution fall back to the directive-derived path, then any read pools (§Storage — reuse a copy another project already has, recording it on a hit), and only then fetch or produce. This is how a moved object is still found at its new home.
  • Self-heal is additive, never destructive. Active resolution refreshes the record to match reality: a relocated object (bytes at the derived path, record stale) has its recorded location refreshed; an untracked object (bytes present, no entry) is registered; a missing one is re-materialized at the current directive and recorded. Because any access that consults the state file and finds nothing proceeds to fetch/produce, resolution can never leave a stale record. The relocate-refresh is the only automatic mutation; active resolution never deletes.
  • A deleted or missing state file repopulates as objects are accessed — it is regenerable by construction.
  • Concurrency. Every write re-reads the file, merges (additive union, last-writer-wins per object), then writes via a temp file + atomic rename, so parallel downloads / produces cannot clobber each other; additive-only updates make the merge conflict-free.
  • Garbage vs. dirty. A malformed entry that roots nothing (e.g. an instance-less residue from a format change) is corruption, not a tracked-but-missing object — it is cleaned silently on read. A well-formed entry whose bytes are merely absent is a dirty state (below), surfaced rather than silently dropped.

Dirty states and explicit reconciliation

The state file is a first-order source of truth for where objects are, kept non-destructively (git-style). A tool MAY classify each object's state-vs-disk as clean, missing (recorded bytes gone), relocated (recorded at L, bytes at the derived path D), untracked (bytes present, no entry — an orphan for produced artifacts), or modified (recorded sha256 ≠ actual; the full expected-vs-actual digest treatment is deferred). Passive listing only labels these states — it never mutates.

Two explicit, user-invoked actions reconcile:

  • --refresh — fix the state file only (no downloads, no file moves): re-point relocated entries to where the bytes actually are, and drop stale / missing entries. A pure state↔disk reconcile. (Untracked artifacts are picked up by active access, not here.)
  • --delete — remove the selected objects' bytes and their entries; the only byte-removing action.

Removal is therefore explicit-only: passive listing and active resolution never delete, --refresh only edits the record, and --delete is the sole byte remover.

Maintenance (inspect, filter, delete)

Both fetched datasets and produced artifacts accumulate, so a tool MAY offer maintenance (the inspect capability): enumerate the physical store, filter it, and delete an explicit selection. Maintenance is user-driven, never automatic: it surfaces what is stored and deletes only what the user explicitly selects. There is no automatic garbage collector — see Why not automatic reachability below.

Reference CLI (non-normative). A tool exposes this however fits — like the @cached surface, the command shape is per-tool, not part of the spec. The reference design folds it into the existing datamanifest list: a default summary view with --field to pick columns; filter flags over the object fields (--kind, --folder, --orphan, --dirty, --older-than, --format, size); and action flags on the selected set — --delete, --refresh (reconcile the state file only), and optionally --move (dry-run/confirm by default). There is no separate gc command.

  • Object fields. Each stored object — a fetched dataset or a produced artifact — exposes a common set of inspectable fields:
  • kinddata (fetched) or cached (produced);
  • key<key> (fetched) or <cachetype>[/<version>]/<hash> (produced); hash for produced;
  • location — the resolved absolute path on disk;
  • referenced — whether a still-present local .toml roots it or it is an orphan. For a produced artifact the match is the full (cachetype, version, hash) tuple against a state-file datacache instance, so another project's artifact (in a different folder) is not mistaken for referenced; for a fetched dataset, its key listed in a datasets.toml;
  • format, size, created, and a best-effort, filesystem-derived last-access time (read from stat, never written on read; MAY be unknown).

These fields are the cross-tool inspectable surface; a tool with the inspect capability MUST be able to report them. - View. A tool presents a default summary (the most useful columns) and SHOULD let the user choose which fields to show. - Filter. Any field is a filter predicate — kind = cached, folder = <path>, referenced = false (orphans), last-access older than 90d, format, size — so the user can target, e.g., "orphaned produced artifacts in this folder not accessed in 90 days." - Act on the selection. Actions operate on exactly the filtered set, uniformly across fetched datasets and produced artifacts: delete (remove the bytes and prune the object's state-file entry) and optionally move (relocate the bytes and repoint the recorded storage_pathdatasets.toml is not edited, so a later re-fetch still follows the datasets_dir directive; gold standard). A tool MUST NOT delete everything by default and MUST NOT delete as a side effect of any other command. The explicit filter + action is itself the selection (typing the action over a filtered set is the confirmation), so a tool MAY apply directly; it SHOULD offer a --dry-run preview. Deletion is always of a user-chosen set, never an automatic sweep. - Protections (the rule is unchanged, generalized). Maintenance never touches data the user owns: a fetched dataset whose storage_path is a user-managed exact path (no $key) or that is skip_download (a passive, externally-managed dependency) is reported as skipped, never moved or deleted — the same guard already used for deletion, applied to both kinds. A lazy_access dataset has no local copy to touch in the first place. Tool-managed (keyed) objects under datasets_dir / datacache_dir are fair game.

Both kinds are reclaimable, with different regeneration costs. Deleting a fetched dataset under datasets/ just means it re-downloads on next use; deleting a produced artifact under cached/ means it recomputes. Neither destroys irreplaceable state — the manifest and the producing function are the sources of truth — which is exactly why a curated delete is safe and an automatic collector is unnecessary. storage_path data and anything outside the tool-managed datasets/ / cached/ trees are never touched by maintenance.

Why not automatic reachability. An earlier draft computed liveness as "no manifest references this key." That model has a hole: the record documents an artifact's producer, but a read-only consumer — another project, or a fresh clone, reading a shared- or group-scoped artifact it did not produce and does not list — never registers as a referrer, so an automatic collector would reap an artifact still in active use. Rather than patch this with ever-more-complete bookkeeping, maintenance keeps a human in the loop; liveness signals are advisory inputs to a filter, not a deletion authority:

  • Last-access is filesystem-derived and best-effort — never written on read. A tool reads it at inspect time from the artifact's filesystem metadata (the stat access time, falling back to the modification time or created when atime is unusable); it MUST NOT rewrite any sidecar, index, or .toml on read to record access. (Touching a file on the lock-free read path would contend with the produce .lock, serialize concurrent readers, and put I/O on the hot path — all for a value that is purely advisory.) Because the OS maintains it, a read-only consumer's use is still reflected wherever the filesystem records it, but the signal is coarse and may be absent: relatime advances atime at most once a day, and noatime, network, and read-only filesystems record nothing — so a tool MAY report last-access as unknown. It is advisory only, never the sole basis for deletion; created (stamped once at produce time in metadata.toml) is the always-available age signal.
  • The referenced field is advisory too. "Is this orphaned?" (no still-present local .toml lists its key) is one more column to show and filter on — input to the user's choice, not an automatic trigger. An orphan is a strong delete candidate, never an automatic deletion.
  • The per-artifact metadata.toml back-pointer remains audit only — never a deletion authority (it goes stale and cannot express multiple references).

Because each project's data lives under its own folder (datacache_dir / datasets_dir), "show/delete only this project" is a trivial path filter, and inspecting across projects never risks an accidental cross-project wipe because deletion is always an explicit selection.

What this spec does not specify

  • The @cached macro / decorator API (Julia macro, Python decorator) is the ergonomic surface over this model and is per-language, not normative — a tool exposes it however fits the language. Normative are: the on-disk formats (key hash, config.toml, metadata.toml, the state file .datamanifest-state.toml), the path composition (the datasets_dir / datacache_dir folders), the identity rules (cachetype default + the stable-name requirement; the (cachetype, version) same-process conflict guard; the index lifecycle), and the maintenance rules (user-driven; no automatic deletion). How a tool derives a default cachetype or detects the conflict is implementation-defined; that it does is not.
  • The artifact serialization format (jls/jld2/pickle/…) is a per-tool, per-format choice; produced artifacts are not assumed cross-language-loadable. A tool SHOULD define a RECOMMENDED language-native default for a format-less produced dataset (see Default serialization format), but the spec does not mandate which.
  • In-place access (no local copy) is the lazy_access mode (see the dataset field table): the uri is opened where it lives by a loader. The mechanism — streaming, an sshfs/FUSE mount, an object-store filesystem — is implementation-defined and not specced; the former standalone mount store is subsumed by lazy_access (one materialization axis: download vs. in-place), so no separate mount capability is defined. What a tool actually supports depends on its loaders and backends.
  • Cloud / fsspec / CAS backends are not a core model; if a tool adds them, they are optional per-language extras behind the recipe interface, not a spec contract.

Cross-machine sync

A stored object — a fetched dataset or an expensive produced artifact — can be transferred between machines instead of re-downloaded or recomputed, because every object has a machine-independent address (a fetched dataset's source key; a produced artifact's cachetype[/version]/hash): the same object has the same logical address everywhere, only the physical root differs. Sync is a transfer between two stores, gated by the optional sync capability.

  • Target = an SSH address (user@host): SSH is both the transport (rsync over ssh) and the host identity; no separate remote registry is required.
  • Each end resolves its own store from its own environment (DATAMANIFEST_*) plus the manifest's [_STORAGE._HOST] rules — not from any knowledge of the remote's project folder. This works because syncable objects live in machine-global folders ($user_data_dir / $user_cache_dir / user-defined); a $repo-relative (local) object is not syncable, which is exactly why the remote project location never has to be known.
  • Symmetric. push and pull differ only in transfer direction; each side resolves its own store identically, so there is no asymmetry between them.
  • Writes no manifest. Sync moves bytes only; it never edits datasets.toml or the state file on either end. A transferred object lands in the global store as an orphan (present, unreferenced) — immediately usable via read-resolution, and registered by the receiver's normal flow if and when its own project uses it.
  • Integrity is the transport's (rsync verifies every file as it copies); the spec adds no separate artifact-level digest. Idempotent: a no-op when the target already holds the object complete (its .complete marker is present).
  • The receiver's folders route it: a transferred object lands under the receiver's own datasets_dir / datacache_dir, resolved from its environment. The recipe version keeps produced-sync safe — logic that changed bumps version, so a divergent artifact never overwrites at the same address.

Addressing for sync. An object is named by its identifier: a fetched dataset by name / alias / doi; a produced artifact by cachetype[/version]/hash (full, or an unambiguous hash prefix). Resolution to exactly one object is the contract — an identifier matching several is a fail-loud error (the general exact-or-error rule; see Identifier resolution), so a bare cachetype or a shared doi is ambiguous and must be disambiguated.

Reference CLI (non-normative). First-order datamanifest push <id> <ssh-host> / pull <id> <ssh-host> transfer a single object; an ambiguous <id> errors unless --batch is given, and --dry-run reports the selection and total size first. Filtered bulk transfer reuses the selection model — datamanifest list <filters> --push/--pull <host> — rather than duplicating filters onto push/pull.

Preservation contract

A conforming writer of language L MUST:

  • Regenerate its own [<dataset>._LANG.L] from its internal state;
  • Copy every other [<dataset>._LANG.X] (X ≠ L) verbatim, without parsing or reordering;
  • Regenerate its own top-level [_LANG.L] (config and loaders map);
  • Copy every other top-level [_LANG.X] (X ≠ L) verbatim;
  • Preserve any _-prefixed structural table that it does not own (_META, unknown future _*) verbatim;
  • Preserve legacy [_LOADERS] verbatim if present and not explicitly migrated.
  • Preserve [_STORAGE] verbatim unless it implements the storage capability, in which case it MAY regenerate its own _STORAGE entries (a shared, non-language-namespaced table).

Implementation pattern: a DatasetEntry keeps foreign _LANG.X subtrees and unknown scalar keys in its extra; the Database keeps foreign top-level [_LANG.X] and unknown _* tables in a database-level extra. Both splice back on write.

Conformance levels

A capability is a named feature that an implementation may support independently of others. An implementation declares the capability set it supports and runs only the fixture-suite tests tagged for those capabilities.

Capability Description
lang-read Parse [<ds>._LANG.<lang>] and [_LANG.<lang>.loaders]; apply the load ladder.
lang-write Regenerate own _LANG.<self> and preserve foreign _LANG.* verbatim on write (full lossless round-trip).
shell-fetch Execute the dataset's shell command template in the fetch ladder.
delegation Cross-language fetch (rung 3, the rare case): run a fetcher defined in another language — mechanism implementation-defined (call the language's runtime, or a peer datamanifest CLI), with fall-through to uri — controlled by delegate / --delegate (see Cross-language fetch, Peer-CLI contract).
storage Honor the datasets_dir / datacache_dir folders, the optional datasets_pools / datacache_pools read pools, $-symbol resolution ($user_data_dir / $user_cache_dir / $repo + user-defined, host-aware via _HOST), and per-dataset storage_path (see Storage).
byte-identity Emit the canonical lexicographic key ordering so the same logical manifest is semantically identical across tools — same keys, same values, same order at every level (verified by the cross-tool fixture). This is the guaranteed constraint. Literal byte-for-byte identity is not assured by default: current TOML writers differ in cosmetic formatting (indentation, blank lines, inline-vs-multiline arrays), so a one-to-one byte match is not always achievable. The Python tool is the normative reference for the canonical byte form; tools MAY offer an opt-in path to it (e.g. datamanifest format, or Julia write(...; canonical=true)).
binding-args Execute the table form of a binding ({ ref, args, kwargs }): call ref(*args; kwargs...) with $var substitution in string values.
cache-produce Cache-layer produce-or-load: function-backed (produced) datasets with parameter-hash keying, optional recipe version, the config.toml / metadata.toml sidecars, and the state file's datacache inventory, materialized under the datacache_dir folder (§Produced datasets). Declared by the cache layer, never by the core fetch capability (packaging — separate package or submodule — is unconstrained).
inspect The user-driven store-inspection toolkit (§Maintenance): enumerate stored objects (datasets + cached, via the state file .datamanifest-state.toml) with their fields (kind, key/hash, location, referenced/orphan, dirty state, format, size, created, last-access), filter them, and act on a selection (delete, refresh, optional move). There is no automatic collector: deletion is always an explicit user selection; referenced/last-access are advisory. The reference CLI exposes it as datamanifest list … --delete.
sync Cross-machine transfer (push / pull) of a stored object between two stores over SSH/rsync, addressed by its machine-independent identifier (name/alias/doi, or cachetype[/version]/hash); each end resolves its own store from env + _HOST ($repo excluded); writes no manifest; integrity via rsync; idempotent (§Cross-machine sync).

Capabilities are independent — a partial implementation may ship lang-read and lang-write without shell-fetch or delegation. The spec and its fixture suite are never forked per language package; divergent per-language pace is expressed by each implementation declaring its supported capability set and pinning to a spec tag.

_META.schema (the integer stored in the file) is the data-model compatibility version and is bumped only on breaking structural changes. The spec-document version (git tag, e.g. spec-v1.0) tracks prose and fixture evolution independently. An implementation conforms to "schema N, spec ≥ vX" — these two axes are independent.

Peer-CLI contract

One way to do cross-language fetch (rung 3) is to call a peer-language datamanifest CLI. This section is the normative invocation interface for any tool that does so. (A tool that instead runs the foreign language's runtime directly does not use this contract.)

Invocation

datamanifest fetch <name> --datasets-toml <path> [--datasets-folder <dir>]
  • <name> — the dataset key as it appears in the manifest.
  • --datasets-toml <path> — absolute or project-relative path to the manifest file.
  • --datasets-folder <dir> — (optional) directory that holds the shared download cache. If omitted, the tool's default cache location applies.

The peer tool resolves its own [<dataset>._LANG.<lang>].fetcher (using its own fetch ladder), writes the result into the shared cache, verifies sha256 if present, and exits non-zero on any failure. It produces no dataset bytes on stdout — the artifact lands in the cache on disk and the calling tool reads it from there.

Discovery and availability

Each language's CLI is discoverable on PATH under a language-specific name, e.g. datamanifest (Python), DataManifest or datamanifest-julia (Julia). The Python datamanifest CLI is the reference peer (the fallback target for cross-language fetch). Before delegating, a tool MUST probe that the peer CLI (and its runtime) is installed and usable; if the probe fails, the delegation rung is silently skipped and the ladder advances to rung 4 (uri download). Probe commands and PATH names are left to each implementation to document.

Deprecations

The following v0 forms are still read for backward compatibility but SHOULD NOT be written by conforming v1 tools:

  • Per-dataset language-named flat fields julia= / python= / callable= (and any other <lang>=) — historically held inline code, which v1 forbids. Replaced by a module:function binding under [<dataset>._LANG.<lang>].fetcher (or the bare, language-implicit fetcher). Kept verbatim in the dataset's extra on read (no auto-rewrite, to avoid touching another language's data); migrate rewrites the ref-shaped ones. A tool MAY emit a one-time deprecation notice.
  • julia_modules / python_includes — retired; the manifest's directory is on the tool's import path by convention. Legacy *_includes values are still read as extra import-path entries for back-compat.

Note: bare fetcher / loader (language-implicit), bare shell (language-agnostic), and top-level [_LOADERS] are not deprecated — they are supported forms (see Language-implicit bindings and shell fetcher). Only the inline-code language-named fields above are legacy.

An opt-in datamanifest migrate command (not normative in this spec) may rewrite a v0 flat file to v1 _LANG form for the tool's own language.

Conformance notes

  • Readers MUST ignore unknown top-level tables and unknown fields rather than erroring, so that new datasets, new _LANG entries, and other tools' extension keys do not break an older reader.
  • Readers MUST preserve unknown _* structural keys verbatim — not treat them as datasets, not drop them on write.
  • Writers SHOULD omit derived fields (host, path, scheme) and any field left at its default value.
  • Writers MUST emit all keys, at every nesting level — top-level tables (structural _* and datasets alike) and the fields within each table, including keys nested in inline { } tables — sorted by Unicode code-point lexicographic order (the shared default of Python sorted() and Julia TOML.print(sorted=true)). No table is special-cased (no _LOADERS/_META-first). This guarantees semantic identity across tools — the same logical manifest round-trips to the same keys, values, and ordering through either tool (the weaker constraint that is always met). It does not, by itself, guarantee byte-for-byte identity: the Python (tomli_w) and Julia (TOML.print) serializers differ in cosmetic formatting (indentation, blank lines, inline-vs-multiline arrays), and current tooling does not always permit a one-to-one byte match. For literal byte-identity the Python tool is the normative reference for the canonical form, and a tool MAY route its output through it opt-in (datamanifest format, or Julia write(...; canonical=true)). Note: because _ (U+005F) sorts after uppercase but before lowercase ASCII letters, an uppercase dataset name sorts before the _* structural tables and a lowercase one after — intended; the canonical ordering is the requirement, not structural-table placement.
  • uri and uris are mutually exclusive on a single dataset.
  • A file with no [_META] section is read as schema v0 (legacy flat), leniently.