Changelog¶
spec-v4.3 (schema _META.schema = 1) — unreleased¶
Remote sources — object-store download schemes, a lazy_access mode for never-materialized
in-place access, and fail-loud identifier resolution. Additive (no _META.schema change); two
new optional dataset fields.
- Object-store
urischemes are normative.s3://,gs://,gcs://,az://,abfs://,abfss://,adl://,gdrive://mean "fetch the named object, then verifysha256as usual." The spec fixes the scheme set and semantics, not the mechanism — a tool fetches with any backend (the Python tool viafsspec; a peer tool via its own packages), and a tool that cannot serve a schemedelegates it or errors unsupported scheme — never silently skips it. HTTP/HTTPS keep their dedicated GET path and are deliberately not in this set. New Download schemes section;manifest.v3.jsondocuments theurischemes. lazy_access(new bool) — an access mode. Open theuriin place via a loader instead of materializing a local copy: no copy, no checksum, no state-file record; a loader is required (a barelazy_accessis an error). The mechanism — streaming, an sshfs/FUSE mount, an object-store filesystem — is implementation-defined; this subsumes the former deferred standalonemountstore (one materialization axis: download vs. in-place), so nomountcapability is added. The "no materialization axis" / deferred-mountspec text is rewritten accordingly.skip_downloadclarified — a management mode (unchanged behavior, sharpened wording). It marks a passive, externally-managed dependency: not downloaded, not verified, never touched by maintenance (e.g. a large user-maintained archive). It is orthogonal tolazy_access(who manages the bytes vs. how they are read) and the two are not meant to combine. This un-overloadsskip_download: in-place/on-the-fly access is nowlazy_access, notskip_download+ a loader.- Identifier resolution is exact-or-error. Resolving a single dataset by
name/alias/doito more than one match is a fail-loud error naming the candidates, never a silent first-match — adoimay be shared across split datasets. New Identifier resolution rule; the sync addressing contract now references it.
spec-v4.2 (schema _META.schema = 1) — unreleased¶
Read pools — reuse a dataset or @cached result that already exists elsewhere on the
machine instead of fetching or recomputing it — plus two resolution/maintenance
clarifications. Additive: no _META.schema change.
[_STORAGE].datasets_pools/datacache_pools. Optional lists of read-only locations added to read-resolution: probed after the recorded and directive-derived location and before downloading/producing. Adatasets_poolshit is checksum-verified against the declaredsha256(mismatch skipped), recorded in the state file, and used in place — no copy (a genuine download still goes todatasets_dir; gold standard).datacache_poolsis symmetric (<pool>/<cachetype>[/<version>]/<hash>,config.toml-gated). Defaults differ by trust:datasets_poolsabsent ⇒ built-in well-known pools (~/.cache/Datasets,$user_data_dir/datamanifest/datasets), since a fetched dataset is checksum-verifiable;datacache_poolsabsent ⇒ no pools (opt-in), since produced artifacts have no de-facto shared location and no content checksum. An explicit list is used verbatim, an empty list disables. Host-composable via[_STORAGE._HOST]; envDATAMANIFEST_DATASETS_POOLS/DATAMANIFEST_DATACACHE_POOLS.manifest.v3.jsontypes both.- Single recorded location (multiple locations decided out). The state file keeps one
storage_pathper object — the last location it was found at or written to, refreshed by self-heal, never grown into a set. Resolution only needs to find one copy; a stray second copy reads asuntrackedand is cleaned up explicitly, and a shared copy elsewhere is found via a read pool. (Reclassifies the spec-v4.1 ROADMAP deferral as out-of-scope.) - Maintenance applies on the explicit selection. A filtered
list … --delete/--moveis itself the explicit user selection, so a tool MAY apply directly and SHOULD offer a--dry-runpreview (the spec-v4.1 "default to a dry run" wording is relaxed; the hard rules — never delete everything by default, never as a side effect — stand).
spec-v4.1 (schema _META.schema = 1) — unreleased¶
Unify the produced-only cached.toml index into a single git-ignored state file,
.datamanifest-state.toml. Splits the committed spec (datasets.toml = what to
track and how — the expectation) from regenerable local state (where each object
actually landed — the ground truth). The manifest's _META.schema is unchanged (still 1);
this is a structural change to the sibling index format (its own _META.schema = 5).
- One inventory for both kinds. The state file records fetched datasets and produced artifacts under two top-level namespaces, parallel to the two storage folders:
datasets: storage key ⇒ resolvedstorage_path+ actualsha256(omitted underskip_checksum);datacache:cachetype[@version]⇒ aref/formatrecipe + aninstancestable mapping each parameterhashto its full artifact directory (@is the reserved version separator). Params leave the index — they live in each artifact'sconfig.toml.- Read-only inventory; the directive is the gold standard. The state file records where
things are and is consulted to find an existing object — it never directs a write.
Every (re)materialization follows the current directive (
datasets_dir/datacache_dir/ per-datasetstorage_path/@cached(storage_path=…)). Read-resolution checks the recorded location first, ahead of any derivation rule, so a moved object is found at its new home. - Git-ignored by default. Artifacts are local, often outside the repo, and not
re-fetchable, so the inventory is regenerable per-machine state — not a committed
reproducibility lock. (A shared-drive project MAY track it, but that is a user's setup, not
the design intent.) The
Manifest.toml-analogue / "applications commit it" framing is dropped. - Self-heal additive, removal explicit-only. Active resolution refreshes a relocated
record, registers an untracked object, and re-materializes a missing one — but never
deletes. A tool MAY label each object
clean/missing/relocated/untracked/modified; passive listing only labels. Two explicit user actions reconcile:--refresh(fix the state file only — re-point relocated, drop stale) and--delete(remove bytes + entry, the sole byte remover). Maintenance (--delete/--move) now spans fetched datasets too, with the user-managed-path /skip_downloadskip-guard generalized to both kinds. - Concurrency. Every write re-reads + merges (additive union, last-writer-wins per object) + atomic-renames, so parallel downloads/produces don't clobber each other.
- Schema. New
state.v4.jsonvalidates the state file (datasets+datacache,_META.schema = 5), supersedingcached.v3.json(kept for the legacycached.tomlshapes,_META.schema1–4, which conforming readers migrate forward;cached.tomlis the recognized legacy name).metadata-sidecar.v3.jsonrenames thecached_tomlback-pointer tostate_file. - Spec vs. state — same fields, two meanings. A dataset's
storage_pathandsha256live in both files on purpose: indatasets.tomlthey are the expectation (directive / contract), in the state file the ground truth (resolved / actual). Intentional, harmless duplication; fully separating them (expected-vs-actualsha256, resolvedstorage_path, multiple recorded locations, themodifiedstate) is deferred — seeROADMAP.md.
spec-v4 (schema _META.schema = 1) — unreleased¶
A breaking storage-layout revision that radically simplifies storage to two paths and
makes everything local by default. The TOML shape stays additive (_META.schema = 1);
the break is in where bytes land on disk (versioned on the spec-tag axis, as spec-v3 was).
Existing stores need migration or a clean re-fetch. The machinery of the earlier spec-v4
drafts — scope, content prefixes, the datamanifest appname, project-name derivation,
DATAMANIFEST_DIR, and the scope/prefix ladders — is removed in favor of letting the user
write the two paths directly.
- Two folder fields, local by default.
[_STORAGE].datasets_dir(fetched) anddatacache_dir(produced) are the whole model. They default to the relative paths"datasets"/"cached"⇒ repo-relative ⇒ visible./datasets/,./cached/. A fetched dataset lands at<datasets_dir>/<key>, a produced artifact at<datacache_dir>/<cachetype>/[<version>/]<hash>/— no scope, no prefix, no derived name in between. Adding apyproject.tomlno longer moves anything. $-symbols. Predefined$user_data_dir/$user_cache_dir(straight fromplatformdirs, bare — no app name) and$repo; any other bare[_STORAGE]key is a user-defined symbol, made host-specific in[_STORAGE._HOST."<glob>"].$USER/env and~expand. Centralize/share with one edit, e.g.datasets_dir = "$user_data_dir/myproj".- Per-dataset
storage_pathreplaces both the formerstoreandlocal_path. Default$datasets_dir/$key; contains$key⇒ tool-managed/keyed, an exact path without$key⇒ user-managed and never touched by maintenance. (Namedstorage_path, notpath—pathis the URI's parsed component.) - Two environment variables,
DATAMANIFEST_DATASETS_DIR/DATAMANIFEST_DATACACHE_DIR(user symbols override asDATAMANIFEST_<NAME>);$user_data_dir/$user_cache_dirkeep per-OS resolution.DATAMANIFEST_DIR,DATAMANIFEST_SCOPE,DATAMANIFEST_PREFIX_*are gone. - Partition-local staging. Materialization stages within the target folder's partition (a
$scratchdataset stages on$scratch) — required for an atomic rename and so voluminous data never transits a small~/.cache. A tool's app-internal files live alongside, never in a separate global folder. cached.tomldropsscope(and the recipestore). Recipes are keyed by(cachetype, version); the on-disk location is the manifest'sdatacache_dir, and reachability is(cachetype, version, hash). (Schema-2 nested structure, hit self-healing, and the index lifecycle are unchanged from spec-v3.7.)- In-memory & multiple manifests (library use). A manifest is a logical structure that MAY
be built in memory and several MAY be live at once, each resolving independently (the API
is per-language — Python / Julia
Database; whether/where it persists is the author's choice). Recommended: a library shipping data dependencies owns them this way — its own manifest with explicitdatasets_dir/datacache_dir(e.g. under$user_data_dir/<library>) — rather than touching the end user'sdatasets.toml; without explicit folders its data falls back to the end user's project, who owns the location. manifest.v3.jsontypes[_STORAGE].datasets_dir/datacache_dirand the datasetstorage_path;cached.v3.jsondrops the recipescope/store. README and the reference guide carry the storage recipe (shared downloads pool + per-project versioned cache, per-host roots).
spec-v3.7 (schema _META.schema = 1)¶
Reconcile the produced-dataset / cache model with the implementation: the cached.toml index
becomes a self-healing, nested schema-2 registry, and the produced-dataset identity model
(cachetype / scope / conflicts / default format) is formalized cross-language. The manifest
_META.schema stays 1; cached.toml's own _META.schema goes 1 → 2 (schema 1 is
still read, always rewritten as 2).
-
cached.tomlfieldproject→scope. The declarable knob is named for what it sets — the scope partition — not for the project id that merely defaults it. There is no per-datasetproject;scopeis recorded per recipe (parallel to a dataset'sstore). -
cached.tomlindex lifecycle: self-healing transparency view. The index is reframed from a write-once log to a per-machine transparency view that converges to the artifacts present on disk. A tool registers on produce (miss), registers-if-missing on cache hit (so a deletedcached.tomlrepopulates as datasets are accessed; the steady-state check is read-only, off the hot path), and prunes a stale entry when it observes the artifact is gone (single ops reconcile their own entry,inspectthe whole file). Pruning a dangling pointer is bookkeeping, not the "never auto-delete data" rule (which protects present bytes). On-diskconfig.tomlstays the cache-validity authority; index mutations use the produce atomic-write+lock and are idempotent.metadata.tomlprovenance remains write-if-absent (hits don't re-stamp it) — only the index re-registers. -
Produced-dataset identity model (cross-language). (i)
cachetypedefault + stable-name rule: absent an explicitcachetype, it derives from the producing function's canonical importable name (so it coincides with the entryref); unique-per-function is the right default (mixing unrelated caches is the worst failure), with the accepted rename-orphans consequence andversionas the deliberate-bust tool; and when the function has no stable importable identity (script / REPL / eval / notebook) a tool MUST require an explicitcachetype, never synthesize one (the generalized pickle constraint). (ii)scopeis ownership, not disambiguation: resolved from the caller's project, isolation-by-default / share-by-opt-in; never affects hit validity. (iii)(cachetype, version)conflict guard: a tool SHOULD raise when two distinct functions claim the same pair live in one process at once (same-process/same-time only;scopeirrelevant). Transient/local/anonymous functions are exempt; when and how to detect is left to each implementer (the earliest practical point for the language — import time in Python; trickier under Julia's precompilation, and possibly infeasible there), hence a SHOULD with no fixed mechanism. (iv) Format coexists under one hash:formatis not a hash input, so several formats share a<cachetype>/[<version>/]<hash>dir; a hit requires thedata.<ext>for the requested format, else recompute. (v) Per-language default format (RECOMMENDED, not normative):pickle(Python) /jld2(Julia), each with a built-in saver + loader, so a format-less produced dataset round-trips. -
cached.tomlschema 2 (nested) + scope ladder + centralized[_STORAGE]. (a) Schema 2 is nested:cached.tomlbecomes aproducedarray of recipe tables keyed by(scope, cachetype, version), each with oneinstancesentry per produced variation (parameterhash+ theparamskey table). Registering accumulates instances (so a recipe's many parameterizations all stay reachable instead of orphaning); recipe-levelref/format/storeare refreshed on register and on hit-if-drifted (not hash inputs). Schema 1 (flat, onehash, no params) is still read (→ one-instance recipe) but rewritten as schema 2. (b) Scope resolution ladder: a producing-callscope=override (highest) →DATAMANIFEST_SCOPE_CACHED→[_STORAGE._SCOPE].cached→ project id;scope=""is one global unscoped store; the scope is resolved once and drives both the path and the recorded entry (no divergence), and reachability is the full(scope, cachetype, version, hash)tuple. (c) Centralized storage: a produced artifact reads the nearest manifest's[_STORAGE](a plain TOML read, no fetch layer), so produced and fetched data share one storage configuration; env overrides win, an explicit config wins over the manifest. Thecached.v3.jsonschema now validates the nested schema-2 form.
spec-v3.6 (schema _META.schema = 1)¶
Replace the spec-v3.4 "warn and fall through" rule for language-implicit bindings with fail-loud semantics.
- A binding that is present for the running language — a bare
fetcher/loader, or an explicit[<ds>._LANG.<self>]— that fails to resolve is an error; one that resolves and then raises propagates. No silent fall-through to a different loader/fetcher (which could hand a program wrong-shaped data behind only a warning, especially for a loader falling through to the format default). - The fetch/load ladders still fall through only for bindings that are absent for the
running language (e.g. another language's
_LANG.<other>fetcher), unchanged. - Multi-language manifests use explicit
[<ds>._LANG.<lang>]bindings (absent — and so correctly skipped — in other languages); bare bindings are the single-language form. - No
--lenientflag. A tool-wide best-effort mode (e.g. "fetch all, skip failures") is a separate, broader concern, intentionally not introduced by this rule. - Docs/semantics only; no schema change.
_META.schemastays 1.
spec-v3.5 (schema _META.schema = 1)¶
Move the shell fetcher out of the _LANG namespace. shell is language-agnostic (the
same command for every tool), so it is now a bare dataset field — a command template run as
a subprocess — rather than a pseudo-language under [<ds>._LANG.shell].
shell = "<command template>"on the dataset is the canonical form; same$varsubstitutions as before; fetcher only.- The legacy
[<ds>._LANG.shell].fetcheris still read and preserved verbatim. schemas/manifest.v3.json: datasets gain ashellstring field.- The bare
julia=/python=/callable=flat fields stay legacy (they historically held inline code, which v1 forbids) — tolerated on read and rewritten bymigrate; not a forward form. The single-language convenience is already covered by barefetcher. - Reconciled the Deprecations section with spec-v3.4: bare
fetcher/loader(language-implicit), bareshell(language-agnostic), and[_LOADERS]are supported forms, not deprecated. _META.schemastays 1: additive field + a relocation with legacy tolerance; spec-tag axis.
spec-v3.4 (schema _META.schema = 1)¶
Language-implicit ("bare") bindings, so a single-language project can skip the
[<dataset>._LANG.<lang>] ceremony. A dataset MAY carry a bare fetcher/loader
directly, and a top-level [_LOADERS] MAY carry a bare format → binding map; a reading
tool interprets these as bindings in its own language.
- Precedence: an explicit
[<dataset>._LANG.<self>]binding (and[_LANG.<self>.loaders]) overrides the bare one. - Tolerant: a bare binding that does not resolve in the running language warns and falls through the ladder — never a hard error (a shared single-author manifest will legitimately fail in the other language).
- Round-trip: a writer preserves a bare binding verbatim — it never promotes it into
_LANG.<self>; tools write_LANG.<self>only for bindings they generate. [_LOADERS]is reclassified from deprecated back-compat to the tolerated language-implicit counterpart of[_LANG.<self>.loaders].schemas/manifest.v3.json: datasets gainfetcher/loader(binding-typed);_LOADERSis aformat → bindingmap._META.schemastays 1: additive (new optional fields + a tolerant read rule); versioned on the spec-tag axis.
spec-v3.3 (schema _META.schema = 1)¶
Harmonize executable bindings into one form, used identically at every site. A
binding — a per-dataset fetcher/loader and now every entry in a
[_LANG.<lang>.loaders] format map — is either a module:function string or a
{ ref, args, kwargs } table.
- The string is an alias for the ref-only table (
"M:f"≡{ ref = "M:f" }); a reader accepts either form anywhere a binding is allowed. - Writer rule: a binding with no
argsand nokwargsMUST be written as the string; the table form is used only when it carries arguments. - Call semantics follow the arguments: none → the tool's conventional call (path
injected for a loader, fetch context for a fetcher);
args/kwargs→ an explicitref(*args; kwargs...)with$varsubstitution and no auto-injection. schemas/manifest.v3.json:[_LANG.<lang>.loaders]values now accept the table form (were string-only), via the sharedbindingdefinition.shell.fetcherstays a command-template string (not amodule:functionbinding)._META.schemastays 1: the TOML shape is back-compatible (a widening plus a writer rule); versioned on the spec-tag axis.
spec-v3.2 (schema _META.schema = 1)¶
Fix the inspect last-access rule. The previous wording ("a tool SHOULD touch an
entry's access time on read") invited a write-on-read implementation — rewriting a
sidecar/index .toml on every read — which contends with the produce .lock,
serializes concurrent readers, and puts I/O on the lock-free hot path, all for an
advisory value.
last-accessis now filesystem-derived and never written on read: a tool reads it fromstat(access time, with modification-time /createdfallback) at inspect time, and MUST NOT rewrite any sidecar/index/.tomlto record access.- The signal is explicitly coarse and may be unknown (
relatime,noatime, network/read-only filesystems);created(stamped once at produce time) is the always-available age signal. _META.schemastays 1: no data-model change; this is a behavioural correction on the spec-tag axis.
spec-v3.1 (schema _META.schema = 1)¶
Refinement of the parameter-hash rules: finite floats are now permitted as hash
inputs. A float serializes through the same canonical-JSON projection as every other
value — the Python reference json.dumps form is normative (1.0 → 1.0, 0.1 →
0.1) and a non-Python tool MUST reproduce it byte-for-byte. NaN / ±Inf remain
disallowed (no JSON representation) and nulls remain disallowed (an absent parameter is
omitted, not encoded as null). Passing a float-valued knob as a string is still
allowed and remains the most cross-tool-stable option.
SCHEMA.md§Parameter-hash keying: the value-restriction now lists finite floats and pins their canonical form.schemas/config-sidecar.v3.json:hashValuenow admitsnumber.- New conformance fixture
config_sidecar_floatpins a float reference vector ({"grid":"5x5","sigma":0.5,"threshold":1.0}→ SHA-256acc37c63…). _META.schemastays 1: the data-model shape is unchanged; this is a hash-input validation widening on the spec-tag axis (the v2 → v2.1 convention).
spec-v3 (schema _META.schema = 1)¶
A breaking behavioral revision of the storage and cache model. _META.schema stays
1 — the TOML shape is back-compatible (changes are resolution semantics + additive
structural tables); the break is in resolution and layout, which is what the spec-tag
axis versions. Nothing implemented spec-v2 storage yet, so practical migration is nil.
-
Storage: top-level folder roots + layer-applied prefixes + scope. Folder variables (
$data/$cache/$repo/user-defined) are now bare top-level roots ($data=user_data_dir, not…/Datasets). The lowercase content prefixesdatasets/(fetch) andcached/(produce) are applied by the consuming layer, configurable via[_STORAGE._PREFIX]/DATAMANIFEST_PREFIX_*. A newscopepartition segment ([_STORAGE._SCOPE]/DATAMANIFEST_SCOPE_*) controls sharing — empty fordatasets(shared), the project id forcached(project-isolated). NewDATAMANIFEST_DIRapplication base. Breaking:[_STORAGE]folder values drop their/datasetssuffixes; fetched paths become<root>/datasets/[scope/]<key>._PROFILEis shelved (reserved, preserved verbatim);_HOSTkept. -
Produced datasets: composition + recipe
version. A produced artifact composes its path via folder /cachedprefix / scope:<folder>/cached/<project-id>/<cachetype>/ [<version>/]<hash>. The cached scope defaults to the project id (declared[_META].project→pyproject.toml/Project.tomlname/uuid → path hash). New optionalversionpath segment — a human-set recipe/code version (not in the parameter hash) that prevents a stale cross-branch/clone hit. -
inspect(renamed fromcache-gc): user-driven maintenance, no automatic collector. Replaces root-reachability GC (which had a read-only-consumer hole) with a field-oriented store listing: enumerate objects (datasets + cached) bykind,key/hash,location,referenced/orphan,scope,format,size,created,last-access; filter; and act on an explicit selection (delete, optionalmove). Never deletes by default; liveness/last-accessare advisory. Reference CLI:datamanifest list … --delete. -
sync: cross-machine transfer. New optional capability —push/pulla stored object between two stores over SSH/rsync, addressed byname/alias/doi(fetched) orcachetype[/version]/hash(produced). Each end resolves its own store from env +_HOST($repoexcluded); symmetric; writes no manifest (objects arrive as orphans); integrity via rsync; idempotent.
spec-v2.1 (schema _META.schema = 1)¶
Prose-only correction on the spec-document axis — no _META.schema bump, no on-disk
format change, fixtures unaffected.
- Produce-or-load is a layer, not necessarily a separate package. spec-v2 baked a
distribution decision into the format spec ("lives in a companion package that depends
on datamanifest"). spec-v2.1 separates the two concerns it conflated: it keeps the
normative capability boundary (
cache-produce/cache-gcare never declared by the core fetch capability; the core keeps no GC and no disposability) and relaxes the packaging mandate — shipping the layer as a separate package or as an optional module of the same package is now explicitly the implementation's choice. Rationale and the per-language packaging asymmetry:design/package-architecture.md.
spec-v2 (schema _META.schema = 1)¶
Two changes, both on the spec-document axis (no _META.schema bump — see
design/storage-model-revision.md):
- Storage model revision. Stores-with-policy become a
$-folder-variable namespace — locations only, no lifetime policy in the core. This revises the spec-v1.1store/[_STORAGE]semantics (thestorevalue-grammar changes; bare names are hard-migrated to$-form), gated by thestoragecapability. - Produce-or-load as a companion layer. Promotes the
@cacheddesign (design/caching-and-dataset-storage.md§6.D) as a cross-tool format spec, but the layer itself lives in a companion package (one per language), not the core. Additive over adatasets.toml: no new hand-authored field, no schema change; a produced dataset is recorded only in machine-generated sidecars + thecached.tomlindex (each carrying its own_META.schema = 1).cache-produce/cache-gcare declared by the companion, not the core; the core keeps no GC and no disposability.
Storage model revision (storage)¶
- Folders are a
$-variable namespace.[_STORAGE]holds folder variables — built-in$data/$cache/$repoplus any user-defined key (scratch = "…"→$scratch) — and the new project-widedefaultselector. Built-ins resolve to dataset-root locations:$data=user_data_dir("datamanifest")/Datasets,$cache=user_cache_dir("datamanifest")/Datasets,$repo=<project_root>/datasets(the exact v1.1 on-disk paths — no re-download). - Two field kinds. Selectors (
default, a dataset'sstore) are$-folder references, optionally with a sub-path ($cache/sub), keying the dataset at<resolved-folder>[/sub]/<key>;storedefaults todefault,defaultto$data. Path expressions ([_STORAGE]values,local_path) are full paths interpolating$-folders,$USER/env, and~. $-references only (hard migration). Barestore = "data"(the v1.1 form) is no longer valid; a spec-v2storagetool MUST reject it. Bare keys appear only as folder definitions in[_STORAGE].- One host-aware resolution ladder for every variable (built-in and user-defined):
DATAMANIFEST_<NAME>_DIRenv →_PROFILE.<name>→_HOST.<glob>.<name>→[_STORAGE].<name>→ built-in default. Host-specificity lives entirely in resolving the variable — there is no per-dataset_HOSTmap; a machine-specific exact path is alocal_pathinterpolating a host-resolved variable. mountremoved from the model. A locations-only model has no home for never-materialized in-place access; spec-v2 defines nomountcapability. In-place access is deferred to a future revision — seeROADMAP.md.
Produce-or-load (companion-layer) features¶
- Produced datasets. A dataset whose bytes come from running a project
function rather than a
uri. It has nodatasets.tomlentry — it originates from the@cachedsurface and is recorded only in machine-generated files.cachetypeis not adatasets.tomlfield; it is a namespace that appears only in those records (thecached.tomlentry, theconfig.toml_META, and the on-disk path). Defaults tostore = "$cache"and is keyed by a parameter hash rather than host/path/version. Unifies external-vs-produced into one "recipe + key + store + policy" object — the only new axis is parameter-hash keying. - Parameter-hash keying. A produced dataset's
keyis<cachetype>/<hash>, where the hash is the SHA-256 of the canonical JSON (JCS, RFC 8785) of its hash-affecting parameters — the producing function's keyword parameters (produced datasets are keyword-only; positionalargshave no stable name→value identity to hash). A three-way parameter split is normative: hash-affecting params (in the hash, inconfig.toml) vs_-prefixed runtime knobs (excluded) vs audit-only extras (inmetadata.toml). - Self-describing sidecars (
cache-produce).config.toml(the re-hashable key table +_META.cachetype/hash) andmetadata.toml(provenance: created, tool+version, host, user,[git],[origin]) sit next to each produced artifact, materialized via the v1.1 safe-materialization primitive. cached.tomlindex (cache-gc). A siblingManifest.toml-analogue listing produced datasets by portable key (cachetype+hash), kept out of the hand-authoreddatasets.toml. Gitignored per-machine by default; opt-in commit for shared reproducibility.- Garbage collection (
cache-gc). The companion'sgcis a root-reachability collector: roots are still-existingdatasets.toml(incl.$cache-folder entries) andcached.tomlfiles, discovered via a depot-level usage log. An artifact under$cacheis collectable iff no live root references its key and it is older than a grace age; the per-artifact back-pointer is audit-only. The core keeps no GC. - New capabilities.
cache-produce(produced datasets + sidecars) andcache-gc(thecached.tomlindex + usage log +gc) — both declared by the companion package, not the core fetch tool.
Cross-language fetch (rung 3) clarified¶
delegateis now a defined field (per-dataset bool) plus the per-run--delegateflag, and a briefSCHEMA.md§Cross-language fetch frames the rung as the rare case (a dataset whose bytes need a fetcher in another language, with no native/shell/uri).- Mechanism left to the implementation: a tool may call the other language's runtime
directly or fall back to the Python CLI (the reference implementation, which aims to
cover every language). Fall-through to
uriwhen the toolchain is absent. Does not extend to produced (@cached) datasets.
Explicitly deferred¶
- The
@cachedmacro/decorator API is per-language, not normative (only the on-disk formats + GC rule are). - Produced-artifact serialization format is a per-tool/per-
formatchoice (no cross-language loading implied). - Cloud /
fsspec/ CAS backends remain optional per-language extras, not a spec contract. In-place / mounted access (the formermountstore) is deferred past spec-v2 — seeROADMAP.md. - Where the companion package keeps its own app-internal state is a companion concern, not part of this cross-tool format spec.
spec-v1.1 (schema _META.schema = 1, additive)¶
Additive storage model — no schema bump (old readers preserve the new field/table verbatim), tracked on the spec-document axis. New capabilities gate it.
New features¶
- Storage model (
storefield +[_STORAGE]). A dataset is materialized into a named store:data(persistent, default),cache(disposable),repo(project-tracked), ormount(transient, accessed in place). Stores have two policy axes — materialization (local/mount) and retention. The optional[_STORAGE]structural table configures each store's root, with_HOST(hostname glob/regex) and_PROFILEoverride sub-tables. - Language-independent default locations. Default
data/cacheroots follow theplatformdirsuser_data_dir/user_cache_dirconventions and are normative, so Python and Julia resolve the same dataset to the same path (at minimum for reading) and genuinely share a store. Read resolution MUST cover the canonical locations. - New capabilities.
storage(honorstore+[_STORAGE]resolution) andmount(the transient mounted store). Tools withoutstoragepreservestore/[_STORAGE]verbatim.mountmechanics are not yet specified; tools should not advertise it yet. - Canonical key ordering. All keys at every level are emitted in Unicode
code-point lexicographic order (no
_LOADERS/_META-first special case), so a logical manifest serializes to byte-identical output across tools. Newbyte-identitycapability + a planned cross-tool fixture guard it. (Previously each tool sorted differently — Python by dataclass field order, Julia alphabetically — causing churn.) - Parameterized bindings. A per-dataset
fetcher/loadermay be a{ ref, args, kwargs }table instead of a bare string, so one function is reused across datasets that differ only in arguments.argsis an ordered positional array,kwargsa keyword table — both plain data, never code — and the tool callsref(*args; kwargs)explicitly (no auto-injection), with shell-style$varsubstitution in string values. Values with no TOML type (e.g. a JuliaSymbol) are written as plain strings. Newbinding-argscapability; a tool that runs the language but lacks it MUST error onargs/kwargsrather than ignore them. - Normative resolution & concurrency. Fixed read order (
repo→data→cache), shared env-var names (DATAMANIFEST_DATA_DIR/_CACHE_DIR/DATAMANIFEST_PROFILE) and precedence, and a cross-tool concurrency convention (atomic publish,.completemarker,.lockpidfile) so peer tools share a store safely.platformdirsis the reference for default paths.sha256is verified at fetch, not re-verified on load.
v1 (schema _META.schema = 1)¶
Breaking structural changes¶
- Structural
_*keys. Keys beginning with_are reserved at the top level (_META,_LANG, legacy_LOADERS) and within a dataset table (_LANG). They are not datasets and must not be treated as such. _LANGnamespace. Per-dataset executable bindings (fetcher,loader) now live under[<dataset>._LANG.<lang>]. Project-wide format defaults live under[_LANG.<lang>.loaders]. The flat per-datasetjulia=/python=/callable=/shell=/loader=keys are deprecated.module:functionrefs only. All executable references aremodule:functionstrings. Inline code (e.g. Juliainclude_string) and*_modules/*_includesfields are retired; the tool puts the manifest's directory on the import path by convention.[_META]header. A v1 manifest carries[_META]withschema = 1. A file without[_META]is read as v0 (legacy flat), leniently.
New features¶
- Resolution ladders. The fetch ladder is: own language →
shell→ (opt-in) peer-CLI delegation →uri→ error. The load ladder is: own → manifest format default → built-in default → error. Load never delegates across a process boundary. - Preservation contract. A conforming writer regenerates its own
_LANG.<self>and copies every other_LANG.*verbatim on write, ensuring lossless round-trips in multi-language projects. - Conformance levels. Named capabilities (
lang-read,lang-write,shell-fetch,delegation) let partial implementations declare what they support and run the matching fixture-suite tests. The spec is never forked per package. - Peer-CLI contract. Normative invocation interface for opt-in delegation:
datamanifest fetch <name> --datasets-toml <path>. - Conformance fixture suite.
tests/fixtures/holds example manifests and machine-readable expected outcomes consumed as tests by all implementations.
Deprecations (still read; should not be written by v1 tools)¶
[_LOADERS]— replaced by[_LANG.<lang>.loaders].- Per-dataset
julia=,python=,callable=,shell=,loader=— replaced by[<dataset>._LANG.<lang>].fetcher/.loader. julia_modules,python_includes— retired; legacy*_includesstill accepted as extra import-path entries.
v0 (no [_META] header)¶
Original flat format: one table per dataset with optional top-level [_LOADERS]
and per-dataset language-specific keys (julia=, python=, callable=, etc.).