Using it from your code¶

Where the CLI manages a project's data, the in-code API consumes it: your analysis code resolves and loads what the manifest declares, and never edits it. This page is the narrative guide; the complete list of functions and classes is in the Python API reference. The Julia tabs show the equivalent calls in DataManifest.jl, which reads the same manifest.

PythonJulia

import datamanifest

db = datamanifest.Database("datamanifest.toml")

df = db.load_dataset("co2")          # download on first use, then load
                                     # (pandas/xarray/… per format)
path = db.get_dataset_path("co2")    # just the on-disk path

using DataManifest

db = read_dataset("datamanifest.toml")

df = load_dataset(db, "co2")         # download on first use, then load
path = get_dataset_path(db, "co2")   # just the on-disk path

load_dataset downloads on first use, verifies the checksum, then returns the loaded object using the backend for the dataset's format (install the matching extra). get_dataset_path stops at the on-disk path, for when you want to open the file yourself.

Caching computed results¶

Cache an expensive computation, keyed by its keyword arguments:

PythonJulia

from datamanifest.cache import cached

@cached
def load_anomaly(*, grid="5x5"):
    ...        # expensive; returns e.g. an xarray.Dataset
    return ds

ds = load_anomaly(grid="5x5")                # first call: computes and stores
ds = load_anomaly(grid="5x5")                # later calls: loads and returns
ds = load_anomaly(grid="5x5", cached=False)  # force recompute

using DataManifest

@cached key=(a -> (; a.grid,)) function load_anomaly(; grid::String = "5x5")
    # … expensive computation …
    return ds
end

ds = load_anomaly(grid="5x5")                # first call: computes and stores
ds = load_anomaly(grid="5x5")                # later calls: loads and returns
ds = load_anomaly(grid="5x5", cached=false)  # run the body, no disk I/O

Julia's @cached takes the cache key explicitly (key= maps the keyword arguments to the parameters that identify the result) and saves with the stdlib Serialization (jls) by default — see the Julia caching page.

Each distinct keyword combination is stored separately. The cache key is shared across languages: it is the SHA-256 of the canonical JSON (RFC 8785) of the keyword arguments, with Python's json.dumps float form as the reference — the Julia tool computes the identical key, so caches produced in one language are read by the other. The result is saved with pickle by default; pass format="nc"/"csv"/… to pick a serialization, and version="v2" to invalidate when the function's logic changes. datamanifest list shows cached results grouped by function with their parameters; datamanifest list --orphan --delete cleans up.

The @cached cache shares the same storage and bookkeeping as fetched data — it lands under datacache_dir (default: $user_cache_dir/datamanifest/projects/$project/cached) and shows up in list alongside your datasets. The design notes cover how an artifact's identity (cachetype, version, parameter hash) is derived.

Library cache bundles (database-scoped caching)¶

A library that ships @cached functions should not write into whatever project happens to call it. Binding the cache to a Database gives the library its own cache bundle — its own cache folder, name, and bookkeeping — without touching the host project's folders or state:

PythonJulia

# mylib/_data.py
from datamanifest import Database

_DB = Database(datasets_folder="$user_data_dir/mylib",  # fetched bytes
               storage_config={"project": "mylib"},     # names the cache bundle
               persist=False)

@_DB.cached(key=["grid"])
def landmask(*, grid):
    ...

# MyLib.jl
using DataManifest

const LIBDB = Database(datasets_folder=raw"$user_data_dir/mylib",
                       storage_config=Dict("project" => "mylib"),
                       persist=false)

@cached key=(a -> (; a.grid)) db=LIBDB function landmask(; grid::String)
    ...
end

The db= option takes any expression evaluating to a Database, evaluated at call time — so the database may be defined after the @cached function.

storage_config supplies the [_STORAGE]-shaped configuration a manifest would normally carry — here project = "mylib" names the bundle, so produced artifacts land under …/projects/mylib/cached — and the whole cache context (datacache_dir, $project, lock_stale_age, the state file) comes from the database's frozen configuration instead of the working directory.

An in-memory database (persist=False) never creates a .datamanifest/ outside its own storage roots: the fetched-dataset inventory lives under the datasets folder and the produced-artifact inventory under the resolved datacache_dir (<root>/.datamanifest/state.toml in each), so the caller's project and working directory stay clean.

The bare forms — Python's module-level cached, Julia's @cached without db= — resolve over the default database when a manifest is discoverable, which anchors at the same project as before, so behavior in a normal project is unchanged; when no manifest is discoverable they fall back to the ambient derivation, so caching keeps working in projects without a manifest.

Collision and identity checks are per database (one project's inventory). Two databases share artifacts exactly when they resolve the same datacache_dir; the tools make no cross-project claims about caches that happen to share a directory.

The `Database` object, and the module-level shortcuts¶

The recommended style is to load the database once and call its methods:

PythonJulia

import datamanifest

db = datamanifest.Database("datamanifest.toml")

df = db.load_dataset("co2")
path = db.get_dataset_path("co2")

using DataManifest

db = read_dataset("datamanifest.toml")

df = load_dataset(db, "co2")
path = get_dataset_path(db, "co2")

This is explicit about which project's manifest the code uses, lets several databases coexist in one program, and pins the configuration: a Database takes its configuration snapshot — config files, environment, host — once, when it is created.

The module-level functions are shortcuts over a default database. On first use they locate the project's manifest — walking up from the working directory for the canonical datamanifest.toml or one of the alternate names (DataManifest.toml, datasets.toml, Datasets.toml); DATAMANIFEST_TOML overrides — build the default Database from it, and keep it for the rest of the process — the manifest is read once, not on every call. A no-argument Database() runs the same discovery, so you can hold an explicit db without naming the file. Every datamanifest.X(...) is the method X on that default database, or on the database you pass explicitly — add included, which registers and downloads either way:

PythonJulia

datamanifest.download_dataset("co2")            # the auto-discovered default
datamanifest.download_dataset("co2", db=mydb)   # a specific database

download_dataset("co2")          # the active project's manifest
download_dataset(mydb, "co2")    # a specific database

In Julia the default manifest comes from the active project (julia --project / Pkg.activate): the first of the recognized manifest names found next to the active Project.toml. The DATAMANIFEST_TOML (or DATASETS_TOML) environment variable points at a manifest explicitly and takes precedence over the project discovery.

The rest of the surface — registering and deleting datasets, downloading in bulk, validating loader bindings — is in the Python API reference; the design notes cover the rationale.

A file-less database (no manifest)¶

For library code that wants checksummed downloads into a folder it controls — an OS-appropriate data dir, say — a file-less database skips the manifest entirely: no datamanifest.toml, and nothing written outside the folders the database owns (its inventory lives under the data folder itself). The folder accepts the same $-symbols as the storage model, and the database's methods do everything the module-level functions do:

PythonJulia

from datamanifest import Database

db = Database(datasets_folder="$user_data_dir/mylib", persist=False)
db.add("https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv", name="co2")
path = db.get_dataset_path("co2")   # → ~/.local/share/mylib/gml.noaa.gov/…/co2_annmean_mlo.csv

using DataManifest

db = Database(datasets_folder=raw"$user_data_dir/mylib", persist=false)   # raw"": keep Julia
                                                                          # from interpolating $
add(db, "https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv"; name="co2")
path = get_dataset_path(db, "co2")  # → ~/.local/share/mylib/gml.noaa.gov/…/co2_annmean_mlo.csv

To give such a library its own @cached bundle as well, see library cache bundles.

Using it from your code¶

Caching computed results¶

Library cache bundles (database-scoped caching)¶

The Database object, and the module-level shortcuts¶

A file-less database (no manifest)¶

The `Database` object, and the module-level shortcuts¶