Quick Start

A complete predict→reward cycle for a sleep improvement campaign.

Sleep Improvement

One-size-fits-all sleep advice ignores individual physiology. A 25-year-old male athlete and a 60-year-old sedentary woman respond differently to the same environmental change. BanditDB learns those differences automatically — routing each participant to the intervention most likely to work for their profile, improving with every reported outcome.

from banditdb import Client

db = Client("http://localhost:8080", api_key="your-secret-key")

# 1. Create the campaign once at startup
db.create_campaign(
    "sleep",
    arms=["decrease_temperature", "decrease_light", "decrease_noise"],
    feature_dim=5,
)

# 2. A participant is ready for tonight's intervention.
# Context: [sex, age/100, weight_kg/150, activity_0–1, bedtime_hour/24]
context = [
    1.0,   # female
    0.35,  # age 35
    0.50,  # 75 kg
    0.60,  # moderately active
    0.96,  # bedtime 23:00
]

# 3. Ask BanditDB which intervention to apply
arm, interaction_id = db.predict("sleep", context)
print(f"Tonight's intervention: {arm}")  # e.g., "decrease_temperature"

# 4. Apply the intervention, then reward the next morning
score_before = 62
score_after  = 79
reward = (score_after - score_before) / score_before  # → 0.27

db.reward(interaction_id, reward)

Rewards must be normalised to [0, 1]. Divide your business metric by its maximum possible value, or use a ratio like the example above.

Native Agent Tool Use (MCP)

Give any Claude-based agent persistent decision memory via the Model Context Protocol.

Standard LLM agents are stateless — if they route a task to the wrong model and fail, they repeat the same mistake tomorrow. BanditDB's built-in MCP server gives the entire agent swarm shared persistent memory.

Add the server to your Claude Desktop config at ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "banditdb": {
      "command": "banditdb-mcp",
      "env": {
        "BANDITDB_URL": "http://localhost:8080",
        "BANDITDB_API_KEY": "your-secret-key"
      }
    }
  }
}

The agent now has five tools:

ToolWhat it does
create_campaignCreate a new decision campaign. Accepts algorithm and alpha.
list_campaignsList all active campaigns with arm count and alpha.
campaign_diagnosticsPer-arm theta_norm, prediction count, reward rate. Use when a campaign isn't learning.
get_intuitionReturns the recommended action and an interaction_id to save.
record_outcomeReports success (1.0) or failure (0.0) and updates the shared model.

Every agent in a swarm shares the same BanditDB instance, so the learned model improves with every interaction across the entire fleet.

Choosing an Algorithm

Both algorithms share identical per-arm state. Switching is a single field at campaign creation.

AlgorithmValueHow it exploresWhen to use
LinUCB "linucb" (default) Deterministic UCB bonus: θ·x + α·√(x·A⁻¹·x) Predictable, tunable. Sweep alpha offline to calibrate exploration.
Linear Thompson Sampling "thompson_sampling" Samples θ̃ ~ N(θ, α²·A⁻¹), scores by θ̃·x Natural Bayesian exploration. alpha=1.0 is the principled default. Concurrent users automatically diversify arm coverage.
# LinUCB — tune alpha to control exploration duration
db.create_campaign("routing", ["fast", "cheap"], feature_dim=4, alpha=1.5)

# Thompson Sampling — natural Bayesian exploration, alpha=1.0 is ideal
db.create_campaign("routing_ts", ["fast", "cheap"], feature_dim=4,
                   algorithm="thompson_sampling")

The algorithm field is stored in both the WAL and checkpoint files. Old WAL records and checkpoints without an algorithm field recover as "linucb" automatically.

The Data Science Escape Hatch

Every interaction is event-sourced. Export to Parquet and evaluate policies offline.

POST /checkpoint compiles completed prediction→reward pairs into Snappy-compressed Apache Parquet files — one file per campaign — for offline analysis in Pandas or Polars.

Every prediction will eventually appear in the Parquet file even if its reward arrives hours later. BanditDB re-emits in-flight interactions at each checkpoint so delayed rewards are always captured in a future cycle.

Each row includes a propensity column — the softmax-normalised probability that the logging policy selected the chosen arm given the context (LinUCB only; null for Thompson Sampling). This is the P(a | x) term required by Inverse Propensity Scoring estimators.

import polars as pl
import requests

HEADERS = {"X-Api-Key": "your-secret-key"}

# Snapshot models, export Parquet, rotate the WAL
requests.post("http://localhost:8080/checkpoint", headers=HEADERS)

# Flat schema: interaction_id | arm_id | reward | predicted_at | rewarded_at | propensity | feature_0 …
df = pl.read_parquet("/data/exports/sleep.parquet")
print(df.head())

Offline Policy Evaluation

The Python SDK ships three OPE estimators in banditdb.eval. Install with:

pip install "banditdb-python[eval]"
EstimatorFunctionWhen to use
Replayreplay(df)Sanity check baseline. Unbiased but low coverage (~1/K interactions used).
IPS / SNIPSips(df, clip=10.0)Primary estimator. Uses every interaction with importance weights.
Doubly Robustdoubly_robust(df, clip=10.0)Best statistical efficiency. Use when comparing multiple policies or sweeping alpha.
from banditdb.eval import replay, ips, doubly_robust

df = pl.read_parquet("/data/exports/sleep.parquet")

print(replay(df))
# OPEResult(method='replay', estimate=0.4821, std_error=0.0312, coverage=22.1% [33/149])

print(ips(df))
# OPEResult(method='ips', estimate=0.5103, std_error=0.0187, coverage=100.0% [149/149])

print(doubly_robust(df))
# OPEResult(method='doubly_robust', estimate=0.5219, std_error=0.0141, coverage=100.0% [149/149])

# Compare against the observed reward of the logging policy:
print("Observed:", df["reward"].mean())
# If observed >> estimate, the campaign has learned something real.

Inspecting the WAL

The WAL is plain JSONL — every event is human-readable on disk.

# All campaigns ever created
grep "CampaignCreated" /data/bandit_wal.jsonl | jq '.CampaignCreated.campaign_id'

# Campaigns that have been deleted
grep "CampaignDeleted" /data/bandit_wal.jsonl | jq '.CampaignDeleted.campaign_id'

How Recovery Works

BanditDB survives crashes and restarts automatically. No manual intervention required.

The Two Files

FilePurpose
checkpoint.jsonSnapshot of all campaign matrices (A⁻¹, b, θ, counts) at a specific WAL byte offset.
bandit_wal.jsonlAppend-only event log: CampaignCreated, Predicted, Rewarded, CampaignDeleted.

Phase 1 — Load the Checkpoint

If checkpoint.json exists, BanditDB reads it and restores all campaign matrices directly into memory — no replaying, just deserialisation. The checkpoint records the WAL byte offset at which it was taken.

If no checkpoint exists, BanditDB starts from an empty state and replays the entire WAL from byte 0.

Phase 2 — Replay the WAL Tail

BanditDB opens bandit_wal.jsonl, seeks to the checkpoint's byte offset, and replays every event written after that point. One edge case: after WAL rotation the stored offset may exceed the current file size. BanditDB detects this and seeks to byte 0 instead.

checkpoint.json found? ├── YES → restore all matrices from snapshot │ → open WAL, seek to checkpoint.wal_offset │ → if offset > file size (post-rotation): seek to 0 │ → replay events from that position └── NO → open WAL, replay from byte 0

Data Loss Window

Everything in the WAL is durable. The WAL writer flushes after every write burst and fsyncs before acknowledging a checkpoint. A crash between checkpoints is fully recovered by replaying the WAL tail.

The only data at risk is in-flight predictions — interactions predicted but not yet rewarded at the moment of a crash. These live in the Moka TTL cache in memory. After a crash those interaction IDs are lost and any reward sent for them will return 404.

Mitigate this by checkpointing frequently. BanditDB re-emits in-flight predictions into the WAL tail at each checkpoint, so rewards arriving before the next crash are captured.

What POST /checkpoint Does

  1. Flush barrier — drains all pending events and fsyncs to disk, responds with confirmed byte offset.
  2. Snapshot — serialises all campaign matrices to checkpoint.tmp, atomically renames to checkpoint.json.
  3. Parquet export — joins Predicted + Rewarded events, appends matched pairs to per-campaign Parquet files. Unmatched predictions are re-emitted into the WAL tail.
  4. WAL rotation — truncates WAL to only the tail. Pre-checkpoint history is no longer needed for recovery.

Parquet files are analytics exports only — not used for recovery. Losing them does not affect model state. Recovery uses only checkpoint.json + bandit_wal.jsonl.

Recommended Production Setup

# Auto-checkpoint every 10,000 rewards
BANDITDB_CHECKPOINT_INTERVAL=10000

# Or cap WAL size (useful on edge deployments)
BANDITDB_MAX_WAL_SIZE_MB=50

# Back up the two recovery files on a schedule
cp /data/checkpoint.json  /backup/checkpoint-$(date +%s).json
cp /data/bandit_wal.jsonl /backup/wal-$(date +%s).jsonl

To move BanditDB to a new host: copy checkpoint.json and bandit_wal.jsonl to the same DATA_DIR on the new machine and start. Recovery is automatic.