Skip to content

Quickstart

After installing, declare your first dataset and load it.

datamanifest init                  # create datamanifest.toml here
datamanifest add https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv --name co2
datamanifest list                  # what's tracked, and where it lives
datamanifest path co2              # resolve the on-disk path (for a script)
datamanifest storage               # where data goes on this host; `storage set` to change

The add above downloaded the Mauna Loa CO₂ record and wrote one entry to datamanifest.toml — a plain TOML file you can read and edit by hand:

[co2]
sha256 = "0058b3788040b5c27b2b5c1dd6d26226b7e4deef85e34c153e64806c37df7c75"
uri = "https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv"

Commit datamanifest.toml — it's the recipe (what to fetch and how). The downloaded data and a local .datamanifest-state.toml (which records where each file landed on this machine) stay git-ignored. A collaborator clones the repo and runs datamanifest download to materialize everything. Data lives under ./datasets/ and ./cached/ by default — point it elsewhere with the storage model.

The CLI / API split

The split is the thing to keep in mind:

  • the CLI manages the project's data — set it up, share it, maintain it;
  • the API consumes it — your analysis code resolves and loads what the manifest declares, and never edits it.

So you set things up once on the command line, then your scripts just ask for data by name.

Load it from your code

import datamanifest

df = datamanifest.load_dataset("co2")          # download on first use, then load
                                               # (pandas/xarray/… per format)
path = datamanifest.get_dataset_path("co2")    # just the on-disk path

That's the whole loop: declare on the CLI, consume from code. From here: