datamanifest.toml¶
A small, normative specification for a TOML file that declares the data dependencies of a scientific project — read by tools in different languages.
- One manifest, many languages. A single
datasets.tomldeclares each dataset's source, checksum, format, and how to fetch and load it — and the same file is read unchanged by tools in Python and Julia. - Fetch, verify, extract, load. A tool downloads the dataset, verifies its checksum,
unpacks the archive, and hands your code the local path — re-fetching only when it's
missing. Add a
formatand it loads the data into a native object too. - Portable, local-by-default storage. Fetched datasets and produced artifacts live in
repo-relative folders by default, and can be centralized per host via
[_STORAGE._HOST]glob rules without touching the rest of the manifest. - Produce-or-load caching. An optional companion layer keys produced artifacts by a hash of their parameters, so derived data is rebuilt only when its inputs change.
- Normative and conformance-tested. The prose spec is the source of truth, backed by machine-readable JSON Schemas and a shared fixture suite both implementations run.
Get started¶
# datasets.toml
["jesstierney/lgmDA"]
uri = "https://github.com/jesstierney/lgmDA/archive/refs/tags/v2.1.zip"
sha256 = "da5f85235baf7f858f1b52ed73405f5d4ed28a8f6da92e16070f86b724d8bb25"
extract = true
- Quickstart — the manifest in one minute, declaring datasets.
- Language bindings —
fetcher/loaderreferences, per language. - Storage — where fetched datasets and the produced cache live.
- Schema spec — the full normative reference.
Guide¶
- The manifest in one minute
- Declaring datasets
- Language bindings
- Resolution: the fetch and load ladders
- Storage
- Produced datasets and caching
- Maintenance (inspect)
- Cross-machine sync
- Conformance and versioning
- Migration and deprecations
Reference¶
- Schema specification — the normative
SCHEMA.md. - JSON Schemas — machine-readable validation.
- Examples — a full, runnable manifest.
- Conformance fixtures — the shared test suite.
From the same author¶
A few other open-source tools I maintain.
Scientific writing & data
- texmark — write scientific articles in Markdown and convert them to journal-ready LaTeX/PDF.
- papers — command-line BibTeX bibliography and PDF library manager.
- datamanifest — declarative, reproducible dataset management. (See also the DataManifest.jl Julia port.)
Speech to Text (dictate) and Text to Speech (read-aloud) tools