Skip to content

datamanifest.toml

datamanifest[py]

Keep track of the datasets used in a scientific project.

  • A transparent, trackable manifest. Every dataset a project depends on — URLs, DOIs, checksums, formats — is listed in a single datamanifest.toml you can read at a glance and version with git. The format is language-agnostic (today Python and Julia) and can be edited by hand, from code, or through the CLI.
  • Fetch from a wide range of sources. Direct URLs, Zenodo/figshare DOIs, git repos, object stores (s3://, gs://, …), and bulk imports from pooch, intake or DVC — all checksum-verified, extracted, and adopted in place when already on disk.
  • Cache your own computed data too. The same tooling backs a robust @cached mechanism that stores your own results with PID-lock, keyed by their inputs, to speed up calculations locally. It is a separate, local concern — not a remote source — but shares some of the same benefits such as data management via the CLI.
  • A powerful CLI for data download, local management and synchronization across machines. Add and download datasets, inspect and repair what's on disk, move or centralize where data is stored, and push/pull datasets and cached results between machines over rsync+ssh — all without touching your analysis code.

Get started

pip install datamanifestpy
datamanifest init
datamanifest add https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv --name co2
import datamanifest
df = datamanifest.load_dataset("co2")   # download on first use, then load

Guide

From the same author

A few other open-source tools I maintain.

Scientific writing & data

  • texmark — write scientific articles in Markdown and convert them to journal-ready LaTeX/PDF.
  • papers — command-line BibTeX bibliography and PDF library manager.

Speech to Text (dictate) and Text to Speech (read-aloud) tools

  • scribe — speech-to-text dictation.
  • bard — text-to-speech reader.

Development

  • Conformance — the shared manifest format and what this implementation supports.
  • Roadmap — parked ideas and deferred decisions.