Docs — llmci

Installation

Install llmci from PyPI. Requires Python 3.10 or later. The CLI command is llmci.

pip install llmci

For agent evals with the OpenAI Agents SDK adapter:

pip install 'llmci[agents]'

For development, install from source:

git clone https://github.com/llmci-cli/llmci.git
cd llmci
pip install -e ".[dev]"

Verify your installation:

llmci --version

ℹ

Privacy: llmci is a CLI you run in your own CI — there is no hosted llmci SaaS and eval data stays in your repo/runner by default. If you configure direct API targets or LLM judges, prompts and outputs are sent to the providers you choose (OpenAI, Anthropic, etc.) using your API keys. Deterministic judges (exact match, RAG retrieval, PII scan, structured JSON Schema) do not call external APIs.

Quickstart

Get up and running in under 5 minutes.

1. Try a deterministic example

Start with the ticket-classifier example. It does not call an LLM provider, so it works without credentials:

git clone https://github.com/llmci-cli/llmci.git
cd llmci/examples/01-ci-regression
llmci run

You should see a passing eval report. This is the smallest llmci loop: a JSONL dataset, a command target, an exact_match judge, and thresholded metrics.

2. Initialize your project

Run llmci init to generate a config and starter dataset interactively:

llmci init

# Prompts you for:
#   Target mode:  command / direct
#   Task type:    classification / open_ended / agent
#   Eval name:    my-eval

This creates llmci.yaml and evals/my-eval.jsonl with starter examples.

If you are not sure what to pick, start with command, classification, and the default eval name. That path is deterministic and does not require an API key.

3. Add your eval data

Edit the generated JSONL file. Each line is one test case:

{"input": "My printer won't connect to wifi", "expected": "hardware"}
{"input": "I need a refund for order #882", "expected": "billing"}
{"input": "How do I reset my password?", "expected": "account"}

Or add examples interactively:

llmci dataset add --name my-eval

ℹ

Schema rules: required vs optional fields, command I/O, judge config, and eval level values are defined in one place — the Contracts Reference.

4. Connect your target

For command mode, create the adapter script referenced by llmci.yaml. It reads the JSON file passed as --input and writes a JSON object with an output key to --output:

import argparse
import json

parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True)
parser.add_argument("--output", required=True)
args = parser.parse_args()

row = json.load(open(args.input))
actual = classify(row["input"])
json.dump({"output": actual}, open(args.output, "w"))

For direct API mode, set the provider credential your model needs, for example OPENAI_API_KEY.

5. Run evals

llmci run

You'll see a report like this:

## llmci Eval Report

| Eval   | Metric   | Score | Threshold | Status |
|--------|----------|-------|-----------|--------|
| my-eval| accuracy | 0.950 | ≥ 0.9     | ✅     |

Exit code 0 means all thresholds passed. Exit code 1 means a regression was detected — perfect for CI gates.

Which path should I start with?

Goal	Start here	API key?
Try llmci locally	`examples/01-ci-regression`	No
Test a classifier or deterministic pipeline	`llmci init` with `command` + `classification`	No
Test an LLM prompt directly	`llmci init` with `direct` + `open_ended`	Yes
Gate a RAG pipeline	`examples/12-rag-retrieval`	No for retrieval metrics
Add a full CI gate	`examples/17-integrated-ci-gate`	No
Evaluate an agent	`examples/05-agent-single-turn`	No for constraint checks
Migrate models or providers	`examples/19-cross-provider-migration`	Usually yes

Recommended path

Classification / deterministic — examples/01-ci-regression: command target + exact_match + accuracy.
CI baselines — run llmci run --update-baseline on main, then --compare-to=origin/main on PRs (Baselines & CI).
Pipeline / RAG — examples/12-rag-retrieval: command target returns contexts / retrieved_ids; dataset rows include relevant_ids.
Integrated gate — examples/17-integrated-ci-gate: quality + cost regression + safety in one config.
Agents, migration, Promptfoo — optional workflows once the core gate is green (Agents, Migration, Promptfoo).

Core Concepts

The mental model behind llmci.

Eval = Unit Test for LLMs

An eval is like a test suite. It has a dataset of input/expected pairs, a target (the thing you're testing), a judge (how to score), and metric thresholds (pass/fail criteria).

Targets

The target is whatever you're testing — a prompt, a script, a full pipeline. llmci sends each input to your target and collects the output. Two modes:

Command mode — run any executable. Language-agnostic.
Direct mode — call an LLM API directly via litellm.

Judges

A judge scores each output. llmci includes exact match, LLM-as-judge, custom Python functions, and composite judges for agents.

Thresholds

Each metric has a threshold. Two modes:

absolute — the score must be at least X (e.g., accuracy ≥ 0.90)
max_regression — the drop from baseline must be at most X% (e.g., ≤ 5% drop)

Baselines

A baseline is a snapshot of metric scores (and per-example outputs) stored under .llmci/baselines/{eval_name}.json. PRs compare against baselines to detect regressions. See Baselines & CI for the full workflow.

Contracts Reference

The authoritative reference for eval data, command I/O, judge config, and llmci.yaml fields. Use this section when wiring CI — examples elsewhere link back here.

In this section: Eval levels · Dataset rows · Command I/O · Agent command I/O · Judge config · Eval config · Metric names

Eval `level` values

`level`	Runtime effect	Dataset schema	Required eval fields
`pipeline` (default)	Standard eval loop: load JSONL → run target → judge each example	Standard JSONL (`input` + optional `expected`)	`name`, `dataset`, `judge`, `metrics`
`prompt`	Same as `pipeline` — documentation label for prompt-only testing	Standard JSONL	Same as `pipeline`
`agent`	Agent runner: single- or multi-turn scenarios, trace output, composite judge	Agent JSONL (separate schema)	Above + `level: agent`, command target, `composite` judge; optional `mode`

Only agent changes loader, target runner, and dataset shape. prompt vs pipeline is for humans reading the config — pick whichever label matches what you are testing.

Dataset rows (JSONL file format)

One JSON object per line (JSONL). Blank lines are ignored.
Each line must be a JSON object (not an array or string).
Standard evals and agent evals use different row shapes (see below).

Quick reference

Eval type	Required fields	Optional fields	Example JSONL row
Classification	`input`, `expected`	`id`, `metadata`, `category`	`{"input": "Refund…", "expected": "billing"}`
Reference-free LLM judge	`input`	`metadata`, custom tags	`{"input": "Summarize…"}`
RAG	`input`	`expected`, `relevant_ids`	`{"input": "…", "relevant_ids": ["doc1"]}`
Safety / red-team	`input`	`attack`, `category`, `seed`	`{"input": "…", "attack": "jailbreak"}`
Agent single-turn	`input`, `expected`	`expected.constraints`	`{"input": {"query": "…"}, "expected": {"outcome": "…"}}`
Agent multi-turn	`turns`	`conversation_constraints`	`{"turns": [{"user_message": "…", "expected": {"outcome": "…"}}]}`

Standard eval rows (`level: prompt` or `level: pipeline`)

Used for every eval except level: agent. Each row is loaded into input, expected, and an extra bag for additional fields.

Field	Required?	Type	Notes
`input`	Yes	string	Prompt text or user message. Must be a JSON string (not an object) for standard evals.
`expected`	Depends on judge	string	Required for `exact_match` and classification metrics. May be omitted for `llm`, `rag`, `safety`, `pairwise`, and `structured` judges (defaults to `""`).
Any other key	No	any JSON	Allowed. Stored in `extra`, merged into command input files, and available to judges (e.g. `id`, `metadata`, `relevant_ids`, `images`).

Example rows

Use case	Example JSONL row
Classification	`{"input": "Refund my subscription", "expected": "billing"}`
Reference-free LLM judge	`{"input": "Summarize this article…"}` (no `expected`)
RAG retrieval labels	`{"input": "What is Python?", "expected": "…", "relevant_ids": ["python"]}`
Multimodal direct target	`{"input": "Describe this", "images": ["fixtures/photo.jpg"]}`
Red-team metadata	`{"input": "…", "attack": "jailbreak", "category": "injection"}`

Command-mode input / output (`{input_file}` / `{output_file}`)

For each example, llmci writes one JSON object to a temp file and substitutes its path into your command. Your script reads that file — not the JSONL directly.

{
  "input": "user text",
  "expected": "gold label",
  // plus every field from extra, merged at the top level:
  "relevant_ids": ["doc1"],
  "category": "billing"
}

Your command should write one JSON object to {output_file}:

{
  "output": "model or pipeline answer",
  // optional — for cost/token gates:
  "usage": {"tokens_in": 120, "tokens_out": 45},
  "cost": 0.001,
  // optional — for RAG judges (any other keys become judge metadata):
  "contexts": ["passage 1"],
  "retrieved_ids": ["doc1"]
}

Reserved output keys: output, usage, cost. All other keys are passed to judges as metadata.

Agent eval rows (`level: agent`)

Agent datasets use a separate schema. Requires a composite judge and a command-mode target. Set mode: full_replay (default) or history_injection on the eval.

Single-turn

{"input": {"query": "Return order #5678"}, "expected": {
  "outcome": "return initiated",
  "constraints": {
    "required_tools": ["lookup_order", "initiate_return"],
    "forbidden_tools": ["delete_account"],
    "max_tool_calls": 4
  }
}}

input may be a string or object. The command receives the object (or {"input": "…"} if string).

Multi-turn

{"turns": [
  {"user_message": "What's my order status?", "expected": {"outcome": "status shown"}},
  {"user_message": "Cancel it", "expected": {"outcome": "cancelled",
    "constraints": {"required_tools": ["cancel_order"]}}
]}

Agent command input by mode

Agent evals also write one JSON object to {input_file} per invocation. Output is trace JSON (see Agent Evaluation).

Single-turn

If dataset input is a string, the command receives {"input": "…"}. If it is an object, that object is written as-is:

{"query": "Return order #5678"}

Multi-turn — `full_replay` (one command call per turn)

Turn 0 — empty history:

{
  "user_message": "What's my order status?",
  "history": [],
  "turn_index": 0
}

Turn 1 — history includes prior user/assistant messages from actual agent outputs:

{
  "user_message": "Cancel it",
  "history": [
    {"role": "user", "content": "What's my order status?"},
    {"role": "assistant", "content": "Order #1234 is shipped."}
  ],
  "turn_index": 1
}

Multi-turn — `history_injection` (one command call total)

Prior turns are pre-filled with placeholder assistant text "(prior response)"; only the final user message is executed. For a two-turn scenario:

{
  "user_message": "Cancel it",
  "history": [
    {"role": "user", "content": "What's my order status?"},
    {"role": "assistant", "content": "(prior response)"}
  ],
  "turn_index": 1
}

Optional per-turn context from the dataset is merged into the input object when present.

Judge config schema

type: llm uses rubric only — not criteria. The word "criteria" in prose refers to rubric items evaluated pass/fail.

Judge `type`	Required fields	Scoring config field	Shape
`exact_match`	—	—	Shorthand: `judge: exact_match`
`llm`	`model`, `rubric`	`rubric`	String, or list of `{id, prompt}`
`custom`	`module`, `function`	—	Python file with `evaluate(input, expected, actual)` — all three args are strings (see Judges)
`composite`	`criteria`	`criteria`	List of `{name, type, weight, …}` — trajectory entries may include a nested `rubric` string
`rag`	`criteria`	`criteria`	List of RAG criterion objects (`retrieval_recall`, `faithfulness`, …)
`safety`	`criteria`	`criteria`	List of safety criterion objects (`pii_leakage`, `toxicity`, …)
`pairwise`	`model`	`rubric` (optional)	String comparison instruction
`structured`	`json_schema`	`json_schema`	Inline schema or path to `.json` file

`llmci.yaml` eval fields

Field	Required?	Description
`evals[].name`	Yes	Eval identifier; baseline filename and report label.
`evals[].dataset`	Yes	Path to JSONL, or S3/HTTPS `{source, cache}` object.
`evals[].judge`	Yes	Judge config (see table above).
`evals[].metrics`	Yes (for gating)	List of `{name, threshold, mode}` — names must match computed metrics.
`evals[].level`	No	`pipeline` (default), `prompt`, or `agent`.
`evals[].mode`	No	Agent only: `full_replay` (default) or `history_injection`.
`evals[].target`	No	Override root `target` for this eval only.
`target`	Yes	`command` or `provider`+`model` (+ optional `prompt_file`, `base_url`).
`settings`	No	`parallelism`, `timeout_per_call`, `retries`, sampling, `price_overrides`, etc.

Metric names

Threshold name must match a computed metric. Built-in aggregates:

accuracy, pass_rate, rubric_pass_rate (alias of pass_rate for LLM rubrics), mean_score, median_score, min_score, max_score, error_rate, f1_macro, f1_micro, f1_weighted, precision_*, recall_*, latency_mean, latency_p50, latency_p90, latency_p99, cost_total, cost_mean, tokens_in_mean, tokens_out_mean, tokens_total_mean, cosine_similarity.

Multi-criterion judges also expose each criterion by name as a metric (e.g. retrieval_recall, pii_leakage, win_rate, faithfulness). Plugins may register additional metric names.

llmci.yaml

The config file defines your target, evals, and settings. Field-level contracts (dataset rows, judge shapes, metric names) are in the Contracts Reference.

version: 1

target:
  command: "python3 run.py --input {input_file} --output {output_file}"

evals:
  - name: ticket-classification
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.90
        mode: absolute

settings:
  parallelism: 5
  timeout_per_call: 30
  retries: 1

Remote datasets (S3 / HTTPS)

Datasets can live outside the repo. Use a URI string or the object form with optional caching:

evals:
  - name: ticket-classification
    dataset: s3://company-evals/tickets.jsonl

  - name: response-quality
    dataset:
      source: https://example.com/evals/quality.jsonl
      cache: true

S3 downloads use your normal AWS credentials (env vars, ~/.aws/credentials, or IAM role in CI). Install the optional extra: pip install 'llmci[s3]'. Cached files are stored in .llmci/cache/datasets/.

Field	Type	Description
`version`	int	Config version. Always `1`.
`target`	object	What to test. See Targets.
`evals`	list	One or more eval definitions.
`evals[].name`	string	Eval identifier; used in reports and baseline filenames.
`evals[].level`	string	`prompt`, `pipeline` (labels only), or `agent` (separate dataset schema). Default `pipeline`.
`evals[].dataset`	string \| object	Path to JSONL, or S3/HTTPS source. See Dataset Schemas.
`evals[].judge`	object \| string	Scoring method. See Judges.
`evals[].metrics`	list	Thresholds to gate on. Names must match computed metrics.
`evals[].mode`	string	Agent only: `full_replay` or `history_injection`.
`settings`	object	Parallelism, timeouts, retries, sampling, price overrides.

Targets

Define what llmci tests — a script, a service, or a direct LLM call.

Command Mode

Wrap any executable. Your script receives a JSON input file and writes a JSON output file:

target:
  command: "python3 my_script.py --input {input_file} --output {output_file}"

See Contracts Reference for the exact input/output JSON contracts. In short: llmci writes one merged JSON object per example to {input_file} (including input, expected, and any extra dataset fields); your script writes a JSON object with at least output to {output_file}.

✓

Command mode is language-agnostic. Your script can be Python, Node.js, Go, a Docker container — anything that reads/writes JSON files.

Direct API Mode

Call an LLM provider directly via litellm:

target:
  direct:
    provider: openai
    model: gpt-4o-mini
  prompt_file: prompt.txt

The prompt file uses {input} as a placeholder:

Classify this ticket into: hardware, billing, account, software.

Respond with only the category name.

Ticket: {input}

Set API credentials via environment variables (e.g., OPENAI_API_KEY). All litellm-supported providers work: OpenAI, Anthropic, Azure, Bedrock, Vertex, Ollama, etc.

Dataset rows can include images and/or audio (paths relative to the dataset file, or HTTPS URLs) for multimodal direct targets. See examples/18-multimodal-vision.

Custom Base URL / Proxy

If your organization uses an internal LLM proxy or gateway, set base_url to route requests through it:

target:
  direct:
    provider: openai
    model: gpt-4o
    base_url: https://llm-proxy.internal.company.com/v1
  prompt_file: prompt.txt

Alternatively, you can set the base URL via environment variables (e.g., OPENAI_API_BASE).

Judges

Judges score each example by comparing the target's output against the expected value. Config shapes and field names: Contracts Reference — Judge config.

Exact Match

For classification and deterministic tasks:

judge: exact_match

Strips whitespace and compares strings. Score is 1.0 for match, 0.0 for mismatch.

LLM-as-Judge

For open-ended tasks where there's no single correct answer:

judge:
  type: llm
  model: gpt-4o
  rubric:
    - id: accuracy
      prompt: "Is the response factually correct?"
    - id: completeness
      prompt: "Does the response fully address the question?"

The judge LLM evaluates each criterion independently (pass/fail), and the final score is the fraction of criteria passed. Responses are cached to avoid redundant API calls.

You can also use a single-string rubric for simpler setups:

judge:
  type: llm
  model: gpt-4o-mini
  rubric: "Is the response accurate, complete, and well-written?"

Reference-free evaluation

LLM judges don't require a reference answer. If your dataset only has input fields (no expected), the judge evaluates the output purely against the input and rubric. This is useful for:

Tone and style checking ("Is the response professional and empathetic?")
Safety evaluation ("Does the response contain harmful content?")
Format validation ("Is the response valid JSON with the required fields?")
Relevance checking ("Does the response address the user's question?")

# Dataset without expected — just inputs
{"input": "Write me a professional email declining a meeting"}
{"input": "Explain quantum computing to a 10 year old"}

judge:
  type: llm
  model: gpt-4o-mini
  rubric:
    - id: tone
      prompt: "Is the response appropriately professional?"
    - id: relevance
      prompt: "Does the response directly address what the user asked for?"

When a reference answer is provided, the judge sees both the expected and actual outputs for comparison. When it's omitted, the judge evaluates the output on its own merits against the rubric criteria.

Custom Judge

Write your own scoring logic in Python:

judge:
  type: custom
  module: ./my_judge.py
  function: evaluate

Your function receives three strings — llmci always passes string values at the Python boundary:

def evaluate(input: str, expected: str, actual: str) -> dict:
    # input / expected: from the dataset row (expected is "" when omitted)
    # actual: target output string
    return {"score": 1.0, "reason": "Looks good"}

Standard eval datasets require input as a JSON string. Use type hints like Any only if you parse JSON inside the function yourself.

Return a dict with score (0.0–1.0) and optionally reason.

Composite Judge (Agents)

Combine multiple evaluation criteria for agent workflows:

judge:
  type: composite
  criteria:
    - name: constraints
      type: constraint
      weight: 1.0
    - name: outcome
      type: outcome
      weight: 2.0
    - name: trajectory
      type: trajectory
      weight: 1.0
      rubric: "Did the agent use tools efficiently?"

See Agent Evaluation for details.

RAG Judge

First-class metrics for retrieval-augmented pipelines. Each criterion surfaces as a gateable metric by name:

judge:
  type: rag
  model: gpt-4o-mini
  criteria:
    - {name: faithfulness,        type: faithfulness}
    - {name: retrieval_recall,    type: retrieval_recall,    k: 5}
    - {name: retrieval_precision, type: retrieval_precision, k: 5}

Command targets write structured output; gold retrieval labels use relevant_ids on each dataset row. Retrieval criteria are deterministic (no API key). See the RAG case study and examples/12-rag-retrieval.

Safety Judge

Gate on PII leakage, toxicity, and jailbreak resistance. Higher scores are safer:

judge:
  type: safety
  model: gpt-4o-mini
  criteria:
    - {name: pii_leakage,          type: pii_leakage}
    - {name: jailbreak_resistance, type: jailbreak_resistance}

pii_leakage is deterministic — it scans for emails, phones, SSNs, credit cards, IPv4, and AWS keys. Narrow with categories: [email, ssn] or exempt known-safe values with allow_list: [support@acme.com] / allow_list: [regex:@example\.com$]. Generate adversarial inputs with llmci redteam generate (examples/15-redteam).

Pairwise Judge

Compare each output against the baseline output for the same input and report a win_rate metric. Position-swap averaging controls LLM position bias by default. Requires --compare-to (or committed baselines with per-example outputs).

Structured-Output Judge

Validate JSON output against a JSON Schema (inline or a .json file). Deterministic, no API key. See examples/16-structured-output.

Metrics

Metrics aggregate per-example judge scores into a single number.

Score-Based

Metric	Description	Best for
`accuracy`	Fraction of examples with score = 1.0	Classification
`pass_rate`	Fraction of examples with score ≥ 0.5	Open-ended tasks
`mean_score`	Average judge score across all examples	Rubric-based evaluation
`median_score`	Median judge score (robust to outliers)	Rubric-based evaluation
`min_score`	Lowest score in the dataset	Worst-case analysis
`max_score`	Highest score in the dataset	Sanity checks
`error_rate`	Fraction of examples that errored (timeout, API failure)	Reliability monitoring

Classification

Metric	Description	Best for
`f1_macro`	Macro-averaged F1 across categories	Balanced multi-class
`f1_micro`	Micro-averaged F1 (global TP/FP/FN)	Imbalanced datasets
`f1_weighted`	Weighted F1 by class support	Imbalanced datasets
`precision_macro`	Macro-averaged precision	When false positives are costly
`precision_micro`	Micro-averaged precision	Imbalanced datasets
`precision_weighted`	Weighted precision by class support	Imbalanced datasets
`recall_macro`	Macro-averaged recall	When false negatives are costly
`recall_micro`	Micro-averaged recall	Imbalanced datasets
`recall_weighted`	Weighted recall by class support	Imbalanced datasets

Similarity

Metric	Description	Best for
`cosine_similarity`	Token-overlap cosine similarity (bag-of-words)	Text generation, translation

Latency

Metric	Description	Best for
`latency_mean`	Average response time (ms)	Performance budgets
`latency_p50`	Median response time (ms)	Typical performance
`latency_p90`	90th percentile response time (ms)	Tail latency
`latency_p99`	99th percentile response time (ms)	Worst-case latency

Cost & Tokens (lower is better)

Metric	Description	Best for
`cost_total` / `cost_mean`	Total and per-example cost (USD)	Cost regression gates
`tokens_in_mean` / `tokens_out_mean`	Average input/output tokens	Token budget monitoring
`tokens_total_mean`	Average combined token usage	Overall spend drivers

Direct targets read usage from the provider. When litellm cannot price a model (internal proxies), set settings.price_overrides with per-model input_per_token / output_per_token USD rates. Command targets can opt in by adding usage and cost to output JSON. See examples/17-integrated-ci-gate for a stacked quality + cost + safety gate.

Judge sub-scores

RAG, safety, pairwise, and composite judges expose each criterion as a gateable metric by name — e.g. faithfulness, retrieval_recall, pii_leakage, win_rate.

Threshold Modes

Absolute

The metric must meet a fixed threshold:

- name: accuracy
  threshold: 0.90
  mode: absolute   # accuracy must be ≥ 0.90

Max Regression

The drop from the baseline must not exceed a percentage:

- name: accuracy
  threshold: 0.05
  mode: max_regression   # at most 5% drop from baseline

ℹ

max_regression thresholds require a stored baseline. Run llmci run --update-baseline on your main branch first. For lower-is-better metrics (cost, tokens, latency, error_rate), absolute checks invert (value must be ≤ threshold) and max_regression fails on a rise past the threshold.

Monorepos and multiple configs

Use llmci discover to find every llmci.yaml in a repo, then run all discovered configs with one command:

llmci discover
llmci run --all
llmci run --all --root services/ticket-classifier
llmci run --all --include "services/**" --exclude "services/summarizer/llmci.yaml"

Filters are matched against discovered config paths and can be repeated when you need to include or exclude several service folders.

Baselines & CI

Store baseline scores and detect regressions in pull requests.

Storing baselines

Baselines live at .llmci/baselines/{eval_name}.json — one file per eval, containing aggregate metrics, per-example outputs/scores, timestamp, and commit SHA.

Initialize on main (after your eval passes):

llmci run --update-baseline
git add .llmci/baselines/ && git commit -m "Add eval baselines"

Comparing on pull requests

Three ways to load baselines (first match wins when multiple are available):

--compare-to=origin/main — read baseline files from a git ref (typical CI: checkout with fetch-depth: 0).
Committed files in .llmci/baselines/ on the current branch — loaded automatically when you omit --compare-to.
No baseline — absolute thresholds still work; max_regression and pairwise judges warn and skip.

llmci run --compare-to=origin/main

On main pushes, re-run with --update-baseline to refresh committed baselines after intentional changes.

Baselines store per-example outputs; regressed examples show an Output Diffs vs Baseline section in markdown and HTML reports.

Output formats

PR comments stay markdown. For GitLab, Bitbucket, Azure DevOps, Jenkins, or artifacts, use --output-format:

llmci run --output-format junit --output results.xml
llmci run --output-format sarif --output results.sarif
llmci run --output-format html  --output report.html
llmci run --output-format json  --output results.json

Response caching

Direct API targets cache responses under .llmci/cache/responses/ (keyed on provider, model, prompt, input). Judge LLM calls for RAG, safety, and pairwise share .llmci/cache/judges/. Use --no-cache or --refresh-cache to bypass or rebuild.

GitHub Actions

llmci auto-detects GitHub Actions and posts eval results as a PR comment.

Single job (composite action)

For one eval config per workflow run:

# .github/workflows/llmci.yml
name: llmci Evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0   # required for --compare-to git baselines
      - uses: llmci-cli/llmci@main
        with:
          compare-to: origin/main
          llmci-version: 0.4.1
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Matrix jobs (multiple services)

When parallel matrix jobs each run evals, set LLMCI_REPORT_SLICE so every job merges its report into one PR comment instead of overwriting the others:

strategy:
  matrix:
    include:
      - { service: ticket-classifier, config: llmci.yaml }
      - { service: ticket-classifier, config: llmci-gate.yaml }
      - { service: rag-qa, config: llmci.yaml }
steps:
  - uses: actions/checkout@v4
    with:
      fetch-depth: 0
  - run: pip install llmci
  - name: Run eval
    working-directory: services/${{ matrix.service }}
    env:
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      LLMCI_REPORT_SLICE: ${{ matrix.service }}/${{ matrix.config }}
    run: llmci run --config ${{ matrix.config }} --compare-to=origin/main

On main pushes, use --update-baseline instead of --compare-to. See the full pattern in llmci-testbed.

✓

llmci uses a hidden HTML comment to find its own PR comments. Re-running the action updates the existing comment instead of creating duplicates. With LLMCI_REPORT_SLICE, each matrix job updates its own slice in that single comment.

Model Migration

Re-tune your prompt when switching models or providers.

When you move from one model to another — or across providers (OpenAI → Anthropic, etc.) — your prompt may need adjustments to maintain quality. llmci automates this:

llmci migrate \
  --from openai/gpt-4o-mini \
  --to anthropic/claude-3-haiku-20240307 \
  --eval ticket-classification \
  --optimizer-model openai/gpt-4o

Strategies

`--strategy`	What it does
`prompt` (default)	Iteratively rewrite the prompt from failure examples
`few_shot`	Greedily add train examples as inline few-shot demos (`--max-few-shot`)

Per-provider proxies: --from-base-url, --to-base-url, --optimizer-base-url. See examples/19-cross-provider-migration.

How it works

Split — dataset is split into train (70%), validation (15%), holdout (15%), stratified by category
Baseline — evaluate the original prompt on the holdout set with the source model (this is the quality bar)
Optimize — iteratively improve the prompt (rewrite or few-shot selection), evaluated on train/validation with the target model
Stop — early stopping when validation score plateaus
Report — evaluate the best prompt on holdout with the target model for an honest final score

Options

Flag	Default	Description
`--patience`	3	Iterations without improvement before stopping
`--max-iterations`	20	Maximum optimization iterations
`--min-improvement`	0.005	Minimum score improvement to reset patience
`--max-edit-distance`	none	Reject prompts that change too much
`--max-few-shot`	5	Cap for `few_shot` strategy

Writing changes safely

Migration is designed to be non-destructive by default:

Read-only optimization — llmci reads target.prompt_file but never writes without confirmation.
Report with diff — stdout includes scores, parity verdict, a unified diff of original vs optimized prompt, and iteration history.
Confirm before write — only if the prompt changed, llmci prompts Write optimized prompt to disk? [y/N]. Answer N for a dry run.
No automatic backup — commit your prompt to git before migrating; rollback is git checkout -- prompt.txt (or decline the write).

⚠

Requires target.prompt_file in direct API mode. The few_shot strategy inlines selected train examples into the prompt text — same confirm-before-write flow applies.

Agent Evaluation

Test tool-using and conversational agents with composite judging. Dataset and command I/O contracts: Contracts Reference — Agent command input.

Agent Scenarios

Agent eval datasets use a different format from standard JSONL. See the quick-reference table in Contracts Reference. Single-turn:

{"input": {"query": "What's the weather?"}, "expected": {"outcome": "weather info", "constraints": {"max_tool_calls": 3, "required_tools": ["get_weather"]}}}

Multi-turn:

{"turns": [{"user_message": "Check my order", "expected": {"outcome": "order status"}}, {"user_message": "Cancel it", "expected": {"outcome": "cancellation confirmation"}}]}

Agent Trace Format

Your agent command must output a trace JSON:

{
  "final_output": "Your order has been cancelled.",
  "trace": [
    {"step": 1, "type": "tool_call", "tool": "cancel_order", "args": {"id": "1234"}},
    {"step": 2, "type": "response", "content": "Order cancelled."}
  ],
  "total_tool_calls": 1,
  "total_tokens": 150
}

Building trace output

Agent evals invoke your agent as a command that reads input JSON and writes output JSON. Use TraceBuilder for mocks and custom frameworks, or the OpenAI Agents adapter for SDK runs.

TraceBuilder (any framework)

from llmci.trace import TraceBuilder

tb = TraceBuilder()
tb.tool("get_weather", {"city": "London"}, result="58°F cloudy", tokens=25)
tb.response("It's 58°F and cloudy in London.")
output = tb.to_dict()   # write to {output_file}

OpenAI Agents SDK adapter

from llmci.integrations.openai_agents import run_for_llmci_sync

result = run_for_llmci_sync(build_agent(), {"query": "Weather in Tokyo?"})
# result: final_output, trace, total_tool_calls, total_tokens

Requires pip install 'llmci[agents]' and OPENAI_API_KEY. For CI without an API key, use MOCK_LLM=1 with a TraceBuilder mock — see examples/10-agent-openai-agents.

Composite Judge Criteria

Type	How it works	Requires LLM
`constraint`	Checks tool call budgets, required/forbidden tools, token limits	No
`outcome`	LLM evaluates if the final output matches the expected outcome	Yes
`trajectory`	LLM evaluates the execution path against a rubric	Yes

Multi-Turn Modes

full_replay — command is invoked once per turn with cumulative conversation history built from real prior outputs
history_injection — command is invoked once; prior turns are injected with placeholder assistant replies

Exact JSON written to {input_file} for each mode is documented in Agent command input by mode.

Dataset Tools

Create, curate, and analyze eval datasets.

Initialize a dataset

llmci dataset init --name my-eval --type classification

Creates an empty evals/my-eval.jsonl file.

Add examples interactively

llmci dataset add --name my-eval

Prompts for input/expected pairs and appends them to the dataset.

Check dataset quality

llmci dataset check --name my-eval

Reports category distribution, underrepresented categories, duplicate inputs, class imbalance, and input length statistics.

Import from CSV or JSON

llmci dataset import --name my-eval --from data.csv
llmci dataset import --name my-eval --from data.json --input-column question --expected-column answer

Troubleshooting First Runs

Common setup errors and the fastest fix.

Symptom	Likely cause	Fix
`python3: can't open file 'run.py'`	`llmci init` created a command target, but your adapter script does not exist yet.	Create the script referenced by `target.command`, or start from `examples/01-ci-regression`.
Provider auth error, such as missing `OPENAI_API_KEY`	You chose direct mode or an LLM judge.	Export the provider API key, or use a deterministic command-mode example first.
Dataset parse error	JSONL requires one complete JSON object per line.	Run `llmci dataset check --name <eval-name>` and fix the reported line.
Eval fails with a low score	The actual output does not match the expected value or threshold.	Inspect the per-example output in the report, then adjust the target, dataset, judge, or threshold.
`max_regression` is skipped	There is no baseline to compare against.	Run `llmci run --update-baseline` on your main branch, then compare PRs with `--compare-to=origin/main`.

CLI Reference

Command	Description
`llmci run`	Run evals and report results
`llmci discover`	List discovered llmci config files
`llmci run --all`	Run every discovered config
`llmci run --all --include "services/**"`	Run only discovered configs matching a glob
`llmci run --all --exclude "legacy/**"`	Skip discovered configs matching a glob
`llmci run --smoke`	Run on a subset of the dataset
`llmci run --update-baseline`	Save current scores as baseline
`llmci run --compare-to=main`	Compare against a baseline branch
`llmci run --output report.md`	Write report to a file
`llmci run --output-format junit\|sarif\|json\|html`	Machine-readable or shareable report formats
`llmci run --no-cache` / `--refresh-cache`	Bypass or rebuild response/judge caches
`llmci run --samples N`	Multi-sample runs with statistical aggregation
`llmci migrate`	Optimize a prompt for a new model or provider
`llmci judge calibrate`	Measure judge↔human agreement; detect drift
`llmci redteam generate`	Generate adversarial inputs for safety evals
`llmci init`	Generate llmci.yaml interactively
`llmci dataset init`	Create a new eval dataset
`llmci dataset add`	Add examples interactively
`llmci dataset check`	Analyze dataset coverage
`llmci dataset import`	Import from CSV/JSON
`llmci import-promptfoo`	Convert a Promptfoo config

Global flags:

-v / --verbose — Show progress during runs
--debug — Full debug logging
--version — Show version and exit

GitHub Action

Drop llmci into any GitHub Actions workflow.

- uses: llmci-cli/llmci@main
  with:
    compare-to: origin/main       # baseline branch
    smoke: false                   # run full dataset
    working-directory: .           # dir with llmci.yaml
    llmci-version: 0.4.1            # exact package version
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

For monorepos, use config to point at one service config, or set all: "true" with optional include / exclude globs to run every discovered config.

Input	Default	Description
`compare-to`	`origin/main`	Branch to load baselines from
`smoke`	`false`	Run on a dataset subset
`update-baseline`	`false`	Save current scores as baselines
`config`	(none)	Path to a specific llmci config file
`all`	`false`	Run all discovered config files
`root`	`.`	Directory to search when `all` is true
`include`	(none)	Newline-separated globs of discovered config paths to include
`exclude`	(none)	Newline-separated globs of discovered config paths to exclude
`working-directory`	`.`	Directory containing llmci.yaml
`output`	(none)	Write report to a file path
`github-token`	`github.token`	Token for posting PR comments
`llmci-version`	`0.4.1`	Exact llmci package version to install

Migrating from Promptfoo

One command to convert an existing Promptfoo config.

llmci import-promptfoo promptfooconfig.yaml

This converts:

providers → target (direct API mode)
prompts → prompt_file
tests[].assert → metrics with thresholds
tests[].vars → JSONL dataset rows

⚠

Some Promptfoo features (red teaming plugins, custom providers, JavaScript assertions) are not supported. Warnings are printed during conversion.

FastAPI Classification Service

A common pattern: a FastAPI service that classifies customer support tickets using an LLM. The service has pre-processing (text cleaning, PII redaction) and post-processing (confidence thresholds, fallback routing) around the LLM call.

Full service example: llmci-testbed/services/ticket-classifier

The risk

Any change to the service can affect predictions — not just prompt edits. A developer updating the PII redaction regex might accidentally strip keywords the model relies on. A change to the confidence threshold logic could re-route tickets incorrectly. These bugs don't show up in unit tests.

Prompt-level gating

Test the LLM call in isolation, verifying that the prompt + model produce correct classifications:

version: 1

target:
  direct:
    provider: openai
    model: gpt-4o-mini
  prompt_file: prompts/classify.txt

evals:
  - name: prompt-classification
    level: prompt
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.95
        mode: absolute

This catches prompt regressions fast — no service startup needed, no HTTP overhead. But it misses bugs in the surrounding code.

Service-level gating

Test the full pipeline by hitting the actual FastAPI endpoint. A thin wrapper script calls the service and extracts the classification:

# eval_service.py — llmci command-mode wrapper
import argparse, json, requests

parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True)
parser.add_argument("--output", required=True)
args = parser.parse_args()

data = json.loads(open(args.input).read())
resp = requests.post("http://localhost:8000/classify", json={"text": data["input"]})
result = resp.json()
json.dump({"output": result["category"]}, open(args.output, "w"))

version: 1

target:
  command: "python3 eval_service.py --input {input_file} --output {output_file}"

evals:
  - name: service-classification
    level: pipeline
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.92
        mode: absolute
      - name: accuracy
        threshold: 0.03
        mode: max_regression

Now any change — pre-processing, post-processing, prompt, model config — is caught if it degrades the end-to-end classification quality.

✓

Best practice: Run both levels. The prompt-level eval runs in seconds (no service startup). The service-level eval runs in CI after the service is built. Use max_regression mode on the service-level eval so the pipeline can tolerate minor drops from non-prompt changes while still catching significant regressions. See examples/08-fastapi-service for a runnable version of this pattern.

CI workflow

jobs:
  prompt-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: llmci run              # prompt-level (fast, no service needed)
        working-directory: evals/prompt

  service-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker compose up -d
      - run: sleep 5              # wait for service startup
      - run: llmci run --compare-to=main
        working-directory: evals/service

RAG Pipeline with Retrieval + Generation

A retrieval-augmented generation pipeline where a user question is embedded, matched against a vector store, and the top documents are fed as context to an LLM for answer generation.

Full service example: llmci-testbed/services/rag-qa

The risk

Bugs can appear at any stage: the embedding model could be swapped (changing retrieval quality), the chunking strategy could be adjusted (changing what context the LLM sees), or the generation prompt could be edited. Each affects the final answer differently, and prompt-level testing alone won't catch retrieval-side regressions.

Pipeline-level testing with the built-in RAG judge

Test the full pipeline end-to-end. The built-in rag judge scores retrieval quality deterministically and can gate on faithfulness/relevance with an LLM:

target:
  command: "python3 pipeline/run.py --input {input_file} --output {output_file}"

evals:
  - name: rag-qa
    level: pipeline
    dataset: ./evals/qa.jsonl
    judge:
      type: rag
      criteria:
        - {name: retrieval_recall,    type: retrieval_recall,    k: 2}
        - {name: retrieval_precision, type: retrieval_precision, k: 2}
    metrics:
      - {name: retrieval_recall,    threshold: 0.90, mode: absolute}
      - {name: retrieval_precision, threshold: 0.50, mode: absolute}

The pipeline writes structured output; gold labels use relevant_ids per row:

# output JSON from your command target
{"output": "...", "contexts": ["..."], "retrieved_ids": ["python", "docker"]}

# dataset row
{"input": "What is Python?", "relevant_ids": ["python"]}

See examples/12-rag-retrieval and the live testbed service for a deterministic, API-key-free setup.

Multi-Model Migration at Scale

An organization running GPT-4o across 12 microservices learns that pricing is changing and decides to migrate to a cheaper model. Each service has its own prompt, dataset, and quality bar.

Full service example: llmci-testbed/migration

The challenge

Manually tuning 12 prompts is weeks of work. Each service has different tolerance for quality drops — the billing classifier needs 98% accuracy, the FAQ summarizer can tolerate 90%.

Automated migration per service

Each service already has a llmci.yaml with eval datasets from CI. Migration becomes a loop:

#!/bin/bash
for service in billing-classifier faq-summarizer ticket-router ...; do
  cd services/$service
  llmci migrate \
    --from openai/gpt-4o \
    --to openai/gpt-4o-mini \
    --eval main-eval \
    --patience 5 \
    --max-iterations 30
  cd ../..
done

Cross-provider moves use the same loop with provider/model refs and per-side --from-base-url / --to-base-url when routing through internal proxies. Try --strategy few_shot when prompt rewriting is too brittle.

For each service, llmci:

Establishes the quality bar on the old model (holdout score)
Iteratively optimizes the prompt for the new model
Prints a report with scores, a prompt diff, and parity verdict
Prompts before writing — you confirm per service (Write optimized prompt to disk? [y/N])

Commit optimized prompts only after reviewing the diff. Services with remaining quality gaps get flagged for manual review — typically 1–2 out of 12, not all 12.

Customer Support Agent with Tool Use

A conversational agent that handles customer support: looks up orders, processes refunds, checks inventory, and escalates to humans. Built with an agent framework (OpenAI Agents, PydanticAI, etc.).

Full service example: llmci-testbed/services/support-agent

The risk

Agent bugs are subtle. The agent might use the wrong tool, make too many API calls (cost), call a destructive tool when it shouldn't (safety), or give correct answers via an inefficient path (latency).

Composite evaluation

Use llmci's agent evaluation with constraint, outcome, and trajectory judges weighted by importance:

evals:
  - name: support-agent
    level: agent
    mode: full_replay
    dataset: ./evals/conversations.jsonl
    judge:
      type: composite
      model: gpt-4o
      criteria:
        - name: safety
          type: constraint
          weight: 3.0           # highest weight — safety is non-negotiable
        - name: correctness
          type: outcome
          weight: 2.0
        - name: efficiency
          type: trajectory
          weight: 1.0
          rubric: "Did the agent resolve the issue in a reasonable number of steps without redundant tool calls?"

The eval dataset captures real support conversations with expected outcomes and constraints:

{"turns": [
  {"user_message": "I want to return order #5678",
   "expected": {"outcome": "return initiated",
                "constraints": {"required_tools": ["lookup_order", "initiate_return"],
                                "forbidden_tools": ["delete_account", "issue_refund"],
                                "max_tool_calls": 4}}},
  {"user_message": "Actually, can I get a refund instead?",
   "expected": {"outcome": "refund processed",
                "constraints": {"required_tools": ["issue_refund"]}}}
]}

ℹ

Weight strategy: Safety constraints get the highest weight (3.0) because a tool-use violation is worse than a suboptimal trajectory. Correctness (2.0) matters more than efficiency (1.0) because a correct-but-slow answer is better than a fast-but-wrong one.

Framework integration

See Agent Evaluation for TraceBuilder, the OpenAI Agents adapter, and examples/10-agent-openai-agents.

Summarization Quality Assurance

A content platform generates article summaries for newsletters, social cards, and search snippets. The summaries are produced by an LLM given the full article text. There are no "correct" summaries — quality is subjective and multi-dimensional.

Full service example: llmci-testbed/services/summarizer

The challenge

Exact-match judging doesn't work here. Two perfectly good summaries of the same article can share zero words. What matters is whether the summary is faithful to the source, concise, and complete in covering key points. These qualities require LLM-as-Judge evaluation with clearly defined rubrics.

Multi-criteria rubric

Define an LLM-as-Judge eval with a rubric (not criteria — that field is for composite/RAG/safety judges):

evals:
  - name: summary-quality
    dataset: ./evals/summaries.jsonl
    judge:
      type: llm
      model: gpt-4o
      rubric:
        - id: faithfulness
          prompt: "Does the summary only contain claims supported by the source article? Penalize any hallucinated facts or unsupported conclusions."
        - id: completeness
          prompt: "Does the summary cover the main points of the article? Key findings, conclusions, and context should be present."
        - id: conciseness
          prompt: "Is the summary free of filler, redundancy, and unnecessary detail? It should be tight and to the point."
    metrics:
      - name: mean_score
        threshold: 0.75
        mode: absolute

Reference-free evaluation

Summaries are a natural fit for reference-free judging — there's no single correct answer to compare against. The dataset only needs an input field (the article text). The judge evaluates the generated summary against the input directly:

{"input": "Full article text about Q3 earnings..."}
{"input": "Breaking: new climate report released..."}
{"input": "A retrospective on the 2024 developer survey..."}

No expected field needed. The LLM judge compares the generated summary to the original article, checking faithfulness against the source rather than against a gold reference.

When you do have references

If your team has human-written reference summaries, include them as expected. The judge will use them as an additional signal:

{"input": "Full article about Q3 earnings...", "expected": "Company X reported 15% revenue growth in Q3, driven by..."}

What this catches

Prompt drift — someone tweaks the summarization prompt and faithfulness drops because the model starts embellishing
Model regression — a model upgrade produces verbose summaries that fail the conciseness criterion
Pipeline changes — a preprocessing step is modified (e.g., article truncation for context window limits) and completeness suffers because key paragraphs are cut

✓

Rubric design tip: Write rubrics that describe failure modes, not just ideals. "Penalize any hallucinated facts" is more actionable for the judge LLM than "the summary should be accurate." See examples/03-llm-as-judge for a runnable version of this pattern.

Examples

Runnable examples in the examples/ directory.

Example	Best for	API key?	Case Study
`01-ci-regression`	First local run; exact_match + F1	No	—
`02-model-migration`	Prompt optimization across models	Usually yes	Multi-Model Migration
`03-llm-as-judge`	Open-ended generation with rubric judging	Yes	—
`04-custom-judge`	Python custom judge	No	—
`05-agent-single-turn`	Tool-using agent constraints	No	Support Agent
`06-agent-multi-turn`	Multi-turn conversation testing	No	Support Agent
`07-pipeline-level`	Full RAG pipeline end-to-end	No	RAG Pipeline
`08-fastapi-service`	Service-level pipeline testing	No	FastAPI Service
`09-summarization-qa`	Reference-free LLM judge	Yes	Summarization QA
`10-agent-openai-agents`	OpenAI Agents SDK adapter	Yes, unless mocked	Support Agent
`11-safety-pii`	Deterministic PII-leakage gate	No	—
`12-rag-retrieval`	Deterministic retrieval recall/precision	No	RAG Pipeline
`13-plugin-judge`	Custom judge + metric plugin API	No	—
`14-judge-calibration`	Judge calibration and drift detection	No	—
`15-redteam`	Adversarial dataset + safety gate	No	—
`16-structured-output`	JSON Schema validation judge	No	—
`17-integrated-ci-gate`	Quality + cost regression + safety	No	—
`18-multimodal-vision`	Vision-capable direct target	Yes	—
`19-cross-provider-migration`	Cross-provider migrate + few-shot strategy	Usually yes	Multi-Model Migration

Examples 11–17 run with no API key (fully deterministic). Example 18 requires a vision-capable provider.

Run any example:

cd examples/01-ci-regression
llmci run

Each example has its own README with setup instructions.

Installation

Quickstart

1. Try a deterministic example

2. Initialize your project

3. Add your eval data

4. Connect your target

5. Run evals

Which path should I start with?

Recommended path

Core Concepts

Eval = Unit Test for LLMs

Targets

Judges

Thresholds

Baselines

Contracts Reference

Eval level values

Dataset rows (JSONL file format)

Quick reference

Standard eval rows (level: prompt or level: pipeline)

Example rows

Command-mode input / output ({input_file} / {output_file})

Agent eval rows (level: agent)

Single-turn

Multi-turn

Agent command input by mode

Single-turn

Multi-turn — full_replay (one command call per turn)

Multi-turn — history_injection (one command call total)

Judge config schema

llmci.yaml eval fields

Metric names

llmci.yaml

Remote datasets (S3 / HTTPS)

Targets

Command Mode

Direct API Mode

Custom Base URL / Proxy

Judges

Exact Match

LLM-as-Judge

Reference-free evaluation

Custom Judge

Composite Judge (Agents)

RAG Judge

Safety Judge

Pairwise Judge

Structured-Output Judge

Metrics

Score-Based

Classification

Similarity

Latency

Cost & Tokens (lower is better)

Judge sub-scores

Threshold Modes

Absolute

Max Regression

Monorepos and multiple configs

Baselines & CI

Storing baselines

Comparing on pull requests

Output formats

Response caching

GitHub Actions

Single job (composite action)

Matrix jobs (multiple services)

Model Migration

Strategies

How it works

Options

Writing changes safely

Agent Evaluation

Agent Scenarios

Agent Trace Format

Building trace output

TraceBuilder (any framework)

OpenAI Agents SDK adapter

Composite Judge Criteria

Multi-Turn Modes

Eval `level` values

Standard eval rows (`level: prompt` or `level: pipeline`)

Command-mode input / output (`{input_file}` / `{output_file}`)

Agent eval rows (`level: agent`)

Multi-turn — `full_replay` (one command call per turn)

Multi-turn — `history_injection` (one command call total)

`llmci.yaml` eval fields