Quickstart¶
After installing, declare your first dataset and load it.
datamanifest init # create datamanifest.toml here
datamanifest add https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv --name co2
datamanifest list # what's tracked, and where it lives
datamanifest path co2 # resolve the on-disk path (for a script)
datamanifest storage # where data goes on this host; `storage set` to change
The add above downloaded the Mauna Loa CO₂ record and wrote one entry to
datamanifest.toml — a plain TOML file you can read and edit by hand:
[co2]
sha256 = "0058b3788040b5c27b2b5c1dd6d26226b7e4deef85e34c153e64806c37df7c75"
uri = "https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv"
Commit datamanifest.toml — it's the recipe (what to fetch and how). The
downloaded data and a local .datamanifest-state.toml (which records where
each file landed on this machine) stay git-ignored. A collaborator clones the
repo and runs datamanifest download to materialize everything. Data lives
under ./datasets/ and ./cached/ by default — point it elsewhere with the
storage model.
The CLI / API split¶
The split is the thing to keep in mind:
- the CLI manages the project's data — set it up, share it, maintain it;
- the API consumes it — your analysis code resolves and loads what the manifest declares, and never edits it.
So you set things up once on the command line, then your scripts just ask for data by name.
Load it from your code¶
import datamanifest
df = datamanifest.load_dataset("co2") # download on first use, then load
# (pandas/xarray/… per format)
path = datamanifest.get_dataset_path("co2") # just the on-disk path
That's the whole loop: declare on the CLI, consume from code. From here:
- Using it from your code —
load_dataset, the@cacheddecorator, and the file-lessDatabase. - CLI reference — every command and flag.
- Storage model — where data lives and how to centralize it.
- Adding datasets / importing — Zenodo DOIs, object stores, and other tools' catalogs.