datamanifest[py]¶
Keep track of the datasets used in a scientific project.
- A transparent, trackable manifest. Every dataset a project depends on —
URLs, DOIs, checksums, formats — is listed in a single
datamanifest.tomlyou can read at a glance and version with git. The format is language-agnostic (today Python and Julia) and can be edited by hand, from code, or through the CLI. - Fetch from a wide range of sources. Direct URLs, Zenodo/figshare DOIs, git
repos, object stores (
s3://,gs://, …), and bulk imports from pooch, intake or DVC — all checksum-verified, extracted, and adopted in place when already on disk. - Cache your own computed data too. The same tooling backs a robust
@cachedmechanism that stores your own results with PID-lock, keyed by their inputs, to speed up calculations locally. It is a separate, local concern — not a remote source — but shares some of the same benefits such as data management via the CLI. - A powerful CLI for data download, local management and synchronization across machines. Add and download datasets, inspect and repair what's on disk, move or centralize where data is stored, and push/pull datasets and cached results between machines over rsync+ssh — all without touching your analysis code.
Get started¶
pip install datamanifestpy
datamanifest init
datamanifest add https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv --name co2
- Installation — the package and its optional loader backends.
- Quickstart — declare your first dataset and load it.
- Using it from your code —
load_dataset, the@cacheddecorator, the file-lessDatabase. - CLI reference — every command and flag.
Guide¶
- Use cases — the CLI workflows end to end: add, repair, store, sync.
- Storage model — where data lives on disk and how to centralize it.
- Adding datasets — direct URLs, Zenodo DOIs, object stores, Git LFS.
- Importing from other tools — pooch, intake, DVC, CSV/URL lists.
- Language bindings — one manifest across Python and Julia.
- Related projects — the DataManifest family, and Python alternatives.
From the same author¶
A few other open-source tools I maintain.
Scientific writing & data
- texmark — write scientific articles in Markdown and convert them to journal-ready LaTeX/PDF.
- papers — command-line BibTeX bibliography and PDF library manager.
Speech to Text (dictate) and Text to Speech (read-aloud) tools
Development¶
- Conformance — the shared manifest format and what this implementation supports.
- Roadmap — parked ideas and deferred decisions.