Using it from your code¶
Where the CLI manages a project's data, the Python API consumes it: your analysis code resolves and loads what the manifest declares, and never edits it.
import datamanifest
df = datamanifest.load_dataset("co2") # download on first use, then load
# (pandas/xarray/… per format)
path = datamanifest.get_dataset_path("co2") # just the on-disk path
load_dataset downloads on first use, verifies the checksum, then returns the
loaded object using the backend for the dataset's format (install the matching
extra). get_dataset_path stops at the on-disk path, for
when you want to open the file yourself.
Caching computed results¶
Cache an expensive computation, keyed by its keyword arguments:
from datamanifest.cache import cached
@cached
def load_anomaly(*, grid="5x5"):
... # expensive; returns e.g. an xarray.Dataset
return ds
ds = load_anomaly(grid="5x5") # first call: computes and stores
ds = load_anomaly(grid="5x5") # later calls: loads and returns
ds = load_anomaly(grid="5x5", cached=False) # force recompute
Each distinct keyword combination is stored separately. The result is saved
with pickle by default; pass format="nc"/"csv"/… to pick a serialization,
and version="v2" to invalidate when the function's logic changes.
datamanifest list shows cached results grouped by function with their
parameters; datamanifest list --orphan --delete cleans up.
The @cached cache shares the same storage and bookkeeping as fetched data — it
lands under datacache_dir (./cached/ by default) and shows up in list
alongside your datasets. The design notes cover how an
artifact's identity (cachetype, version, parameter hash) is derived.
Finding the manifest, and targeting a specific database¶
The module-level functions find the project's manifest automatically (walking
up from the working directory; DATAMANIFEST_TOML overrides). To use a specific
database instead, either call the function with db=
(datamanifest.download_dataset("co2", db=mydb)) or use the database's own
methods (mydb.download_dataset("co2")). Every datamanifest.X(...) is just
resolve_db(db).X(...) — the method on db, or on the default database when
db is None. See the docstrings (help(datamanifest)) and the
design notes.
A file-less database (no manifest)¶
For library code that wants checksummed downloads into a folder it controls —
an OS-appropriate data dir, say — a file-less database skips the manifest
entirely: no datamanifest.toml, no state file, nothing written but the data.
The folder accepts the same $-symbols as the storage model, and
the database's methods do everything the module-level functions do: