Installation

Install llmci from PyPI. Requires Python 3.10 or later. The CLI command is llmci.

pip install llmci

For agent evals with the OpenAI Agents SDK adapter:

pip install 'llmci[agents]'

For development, install from source:

git clone https://github.com/llmci-cli/llmci.git
cd llmci
pip install -e ".[dev]"

Verify your installation:

llmci --version

Privacy: llmci is a CLI you run in your own CI — there is no hosted llmci SaaS and eval data stays in your repo/runner by default. If you configure direct API targets or LLM judges, prompts and outputs are sent to the providers you choose (OpenAI, Anthropic, etc.) using your API keys. Deterministic judges (exact match, RAG retrieval, PII scan, structured JSON Schema) do not call external APIs.


Quickstart

Get up and running in under 5 minutes.

1. Try a deterministic example

Start with the ticket-classifier example. It does not call an LLM provider, so it works without credentials:

git clone https://github.com/llmci-cli/llmci.git
cd llmci/examples/01-ci-regression
llmci run

You should see a passing eval report. This is the smallest llmci loop: a JSONL dataset, a command target, an exact_match judge, and thresholded metrics.

2. Initialize your project

Run llmci init to generate a config and starter dataset interactively:

llmci init

# Prompts you for:
#   Target mode:  command / direct
#   Task type:    classification / open_ended / agent
#   Eval name:    my-eval

This creates llmci.yaml and evals/my-eval.jsonl with starter examples.

If you are not sure what to pick, start with command, classification, and the default eval name. That path is deterministic and does not require an API key.

3. Add your eval data

Edit the generated JSONL file. Each line is one test case:

{"input": "My printer won't connect to wifi", "expected": "hardware"}
{"input": "I need a refund for order #882", "expected": "billing"}
{"input": "How do I reset my password?", "expected": "account"}

Or add examples interactively:

llmci dataset add --name my-eval

Schema rules: required vs optional fields, command I/O, judge config, and eval level values are defined in one place — the Contracts Reference.

4. Connect your target

For command mode, create the adapter script referenced by llmci.yaml. It reads the JSON file passed as --input and writes a JSON object with an output key to --output:

import argparse
import json

parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True)
parser.add_argument("--output", required=True)
args = parser.parse_args()

row = json.load(open(args.input))
actual = classify(row["input"])
json.dump({"output": actual}, open(args.output, "w"))

For direct API mode, set the provider credential your model needs, for example OPENAI_API_KEY.

5. Run evals

llmci run

You'll see a report like this:

## llmci Eval Report

| Eval   | Metric   | Score | Threshold | Status |
|--------|----------|-------|-----------|--------|
| my-eval| accuracy | 0.950 | ≥ 0.9     | ✅     |

Exit code 0 means all thresholds passed. Exit code 1 means a regression was detected — perfect for CI gates.

Which path should I start with?

GoalStart hereAPI key?
Try llmci locallyexamples/01-ci-regressionNo
Test a classifier or deterministic pipelinellmci init with command + classificationNo
Test an LLM prompt directlyllmci init with direct + open_endedYes
Gate a RAG pipelineexamples/12-rag-retrievalNo for retrieval metrics
Add a full CI gateexamples/17-integrated-ci-gateNo
Evaluate an agentexamples/05-agent-single-turnNo for constraint checks
Migrate models or providersexamples/19-cross-provider-migrationUsually yes

Recommended path

  1. Classification / deterministicexamples/01-ci-regression: command target + exact_match + accuracy.
  2. CI baselines — run llmci run --update-baseline on main, then --compare-to=origin/main on PRs (Baselines & CI).
  3. Pipeline / RAGexamples/12-rag-retrieval: command target returns contexts / retrieved_ids; dataset rows include relevant_ids.
  4. Integrated gateexamples/17-integrated-ci-gate: quality + cost regression + safety in one config.
  5. Agents, migration, Promptfoo — optional workflows once the core gate is green (Agents, Migration, Promptfoo).

Core Concepts

The mental model behind llmci.

Eval = Unit Test for LLMs

An eval is like a test suite. It has a dataset of input/expected pairs, a target (the thing you're testing), a judge (how to score), and metric thresholds (pass/fail criteria).

Targets

The target is whatever you're testing — a prompt, a script, a full pipeline. llmci sends each input to your target and collects the output. Two modes:

  • Command mode — run any executable. Language-agnostic.
  • Direct mode — call an LLM API directly via litellm.

Judges

A judge scores each output. llmci includes exact match, LLM-as-judge, custom Python functions, and composite judges for agents.

Thresholds

Each metric has a threshold. Two modes:

  • absolute — the score must be at least X (e.g., accuracy ≥ 0.90)
  • max_regression — the drop from baseline must be at most X% (e.g., ≤ 5% drop)

Baselines

A baseline is a snapshot of metric scores (and per-example outputs) stored under .llmci/baselines/{eval_name}.json. PRs compare against baselines to detect regressions. See Baselines & CI for the full workflow.


Contracts Reference

The authoritative reference for eval data, command I/O, judge config, and llmci.yaml fields. Use this section when wiring CI — examples elsewhere link back here.

In this section: Eval levels · Dataset rows · Command I/O · Agent command I/O · Judge config · Eval config · Metric names

Eval level values

levelRuntime effectDataset schemaRequired eval fields
pipeline (default)Standard eval loop: load JSONL → run target → judge each exampleStandard JSONL (input + optional expected)name, dataset, judge, metrics
promptSame as pipeline — documentation label for prompt-only testingStandard JSONLSame as pipeline
agentAgent runner: single- or multi-turn scenarios, trace output, composite judgeAgent JSONL (separate schema)Above + level: agent, command target, composite judge; optional mode

Only agent changes loader, target runner, and dataset shape. prompt vs pipeline is for humans reading the config — pick whichever label matches what you are testing.

Dataset rows (JSONL file format)

  • One JSON object per line (JSONL). Blank lines are ignored.
  • Each line must be a JSON object (not an array or string).
  • Standard evals and agent evals use different row shapes (see below).

Quick reference

Eval typeRequired fieldsOptional fieldsExample JSONL row
Classificationinput, expectedid, metadata, category{"input": "Refund…", "expected": "billing"}
Reference-free LLM judgeinputmetadata, custom tags{"input": "Summarize…"}
RAGinputexpected, relevant_ids{"input": "…", "relevant_ids": ["doc1"]}
Safety / red-teaminputattack, category, seed{"input": "…", "attack": "jailbreak"}
Agent single-turninput, expectedexpected.constraints{"input": {"query": "…"}, "expected": {"outcome": "…"}}
Agent multi-turnturnsconversation_constraints{"turns": [{"user_message": "…", "expected": {"outcome": "…"}}]}

Standard eval rows (level: prompt or level: pipeline)

Used for every eval except level: agent. Each row is loaded into input, expected, and an extra bag for additional fields.

FieldRequired?TypeNotes
inputYesstringPrompt text or user message. Must be a JSON string (not an object) for standard evals.
expectedDepends on judgestringRequired for exact_match and classification metrics. May be omitted for llm, rag, safety, pairwise, and structured judges (defaults to "").
Any other keyNoany JSONAllowed. Stored in extra, merged into command input files, and available to judges (e.g. id, metadata, relevant_ids, images).

Example rows

Use caseExample JSONL row
Classification{"input": "Refund my subscription", "expected": "billing"}
Reference-free LLM judge{"input": "Summarize this article…"} (no expected)
RAG retrieval labels{"input": "What is Python?", "expected": "…", "relevant_ids": ["python"]}
Multimodal direct target{"input": "Describe this", "images": ["fixtures/photo.jpg"]}
Red-team metadata{"input": "…", "attack": "jailbreak", "category": "injection"}

Command-mode input / output ({input_file} / {output_file})

For each example, llmci writes one JSON object to a temp file and substitutes its path into your command. Your script reads that file — not the JSONL directly.

{
  "input": "user text",
  "expected": "gold label",
  // plus every field from extra, merged at the top level:
  "relevant_ids": ["doc1"],
  "category": "billing"
}

Your command should write one JSON object to {output_file}:

{
  "output": "model or pipeline answer",
  // optional — for cost/token gates:
  "usage": {"tokens_in": 120, "tokens_out": 45},
  "cost": 0.001,
  // optional — for RAG judges (any other keys become judge metadata):
  "contexts": ["passage 1"],
  "retrieved_ids": ["doc1"]
}

Reserved output keys: output, usage, cost. All other keys are passed to judges as metadata.

Agent eval rows (level: agent)

Agent datasets use a separate schema. Requires a composite judge and a command-mode target. Set mode: full_replay (default) or history_injection on the eval.

Single-turn

{"input": {"query": "Return order #5678"}, "expected": {
  "outcome": "return initiated",
  "constraints": {
    "required_tools": ["lookup_order", "initiate_return"],
    "forbidden_tools": ["delete_account"],
    "max_tool_calls": 4
  }
}}

input may be a string or object. The command receives the object (or {"input": "…"} if string).

Multi-turn

{"turns": [
  {"user_message": "What's my order status?", "expected": {"outcome": "status shown"}},
  {"user_message": "Cancel it", "expected": {"outcome": "cancelled",
    "constraints": {"required_tools": ["cancel_order"]}}
]}

Agent command input by mode

Agent evals also write one JSON object to {input_file} per invocation. Output is trace JSON (see Agent Evaluation).

Single-turn

If dataset input is a string, the command receives {"input": "…"}. If it is an object, that object is written as-is:

{"query": "Return order #5678"}

Multi-turn — full_replay (one command call per turn)

Turn 0 — empty history:

{
  "user_message": "What's my order status?",
  "history": [],
  "turn_index": 0
}

Turn 1 — history includes prior user/assistant messages from actual agent outputs:

{
  "user_message": "Cancel it",
  "history": [
    {"role": "user", "content": "What's my order status?"},
    {"role": "assistant", "content": "Order #1234 is shipped."}
  ],
  "turn_index": 1
}

Multi-turn — history_injection (one command call total)

Prior turns are pre-filled with placeholder assistant text "(prior response)"; only the final user message is executed. For a two-turn scenario:

{
  "user_message": "Cancel it",
  "history": [
    {"role": "user", "content": "What's my order status?"},
    {"role": "assistant", "content": "(prior response)"}
  ],
  "turn_index": 1
}

Optional per-turn context from the dataset is merged into the input object when present.

Judge config schema

type: llm uses rubric only — not criteria. The word "criteria" in prose refers to rubric items evaluated pass/fail.

Judge typeRequired fieldsScoring config fieldShape
exact_matchShorthand: judge: exact_match
llmmodel, rubricrubricString, or list of {id, prompt}
custommodule, functionPython file with evaluate(input, expected, actual) — all three args are strings (see Judges)
compositecriteriacriteriaList of {name, type, weight, …} — trajectory entries may include a nested rubric string
ragcriteriacriteriaList of RAG criterion objects (retrieval_recall, faithfulness, …)
safetycriteriacriteriaList of safety criterion objects (pii_leakage, toxicity, …)
pairwisemodelrubric (optional)String comparison instruction
structuredjson_schemajson_schemaInline schema or path to .json file

llmci.yaml eval fields

FieldRequired?Description
evals[].nameYesEval identifier; baseline filename and report label.
evals[].datasetYesPath to JSONL, or S3/HTTPS {source, cache} object.
evals[].judgeYesJudge config (see table above).
evals[].metricsYes (for gating)List of {name, threshold, mode} — names must match computed metrics.
evals[].levelNopipeline (default), prompt, or agent.
evals[].modeNoAgent only: full_replay (default) or history_injection.
evals[].targetNoOverride root target for this eval only.
targetYescommand or provider+model (+ optional prompt_file, base_url).
settingsNoparallelism, timeout_per_call, retries, sampling, price_overrides, etc.

Metric names

Threshold name must match a computed metric. Built-in aggregates:

accuracy, pass_rate, rubric_pass_rate (alias of pass_rate for LLM rubrics), mean_score, median_score, min_score, max_score, error_rate, f1_macro, f1_micro, f1_weighted, precision_*, recall_*, latency_mean, latency_p50, latency_p90, latency_p99, cost_total, cost_mean, tokens_in_mean, tokens_out_mean, tokens_total_mean, cosine_similarity.

Multi-criterion judges also expose each criterion by name as a metric (e.g. retrieval_recall, pii_leakage, win_rate, faithfulness). Plugins may register additional metric names.


llmci.yaml

The config file defines your target, evals, and settings. Field-level contracts (dataset rows, judge shapes, metric names) are in the Contracts Reference.

version: 1

target:
  command: "python3 run.py --input {input_file} --output {output_file}"

evals:
  - name: ticket-classification
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.90
        mode: absolute

settings:
  parallelism: 5
  timeout_per_call: 30
  retries: 1

Remote datasets (S3 / HTTPS)

Datasets can live outside the repo. Use a URI string or the object form with optional caching:

evals:
  - name: ticket-classification
    dataset: s3://company-evals/tickets.jsonl

  - name: response-quality
    dataset:
      source: https://example.com/evals/quality.jsonl
      cache: true

S3 downloads use your normal AWS credentials (env vars, ~/.aws/credentials, or IAM role in CI). Install the optional extra: pip install 'llmci[s3]'. Cached files are stored in .llmci/cache/datasets/.

FieldTypeDescription
versionintConfig version. Always 1.
targetobjectWhat to test. See Targets.
evalslistOne or more eval definitions.
evals[].namestringEval identifier; used in reports and baseline filenames.
evals[].levelstringprompt, pipeline (labels only), or agent (separate dataset schema). Default pipeline.
evals[].datasetstring | objectPath to JSONL, or S3/HTTPS source. See Dataset Schemas.
evals[].judgeobject | stringScoring method. See Judges.
evals[].metricslistThresholds to gate on. Names must match computed metrics.
evals[].modestringAgent only: full_replay or history_injection.
settingsobjectParallelism, timeouts, retries, sampling, price overrides.

Targets

Define what llmci tests — a script, a service, or a direct LLM call.

Command Mode

Wrap any executable. Your script receives a JSON input file and writes a JSON output file:

target:
  command: "python3 my_script.py --input {input_file} --output {output_file}"

See Contracts Reference for the exact input/output JSON contracts. In short: llmci writes one merged JSON object per example to {input_file} (including input, expected, and any extra dataset fields); your script writes a JSON object with at least output to {output_file}.

Command mode is language-agnostic. Your script can be Python, Node.js, Go, a Docker container — anything that reads/writes JSON files.

Direct API Mode

Call an LLM provider directly via litellm:

target:
  direct:
    provider: openai
    model: gpt-4o-mini
  prompt_file: prompt.txt

The prompt file uses {input} as a placeholder:

Classify this ticket into: hardware, billing, account, software.

Respond with only the category name.

Ticket: {input}

Set API credentials via environment variables (e.g., OPENAI_API_KEY). All litellm-supported providers work: OpenAI, Anthropic, Azure, Bedrock, Vertex, Ollama, etc.

Dataset rows can include images and/or audio (paths relative to the dataset file, or HTTPS URLs) for multimodal direct targets. See examples/18-multimodal-vision.

Custom Base URL / Proxy

If your organization uses an internal LLM proxy or gateway, set base_url to route requests through it:

target:
  direct:
    provider: openai
    model: gpt-4o
    base_url: https://llm-proxy.internal.company.com/v1
  prompt_file: prompt.txt

Alternatively, you can set the base URL via environment variables (e.g., OPENAI_API_BASE).


Judges

Judges score each example by comparing the target's output against the expected value. Config shapes and field names: Contracts Reference — Judge config.

Exact Match

For classification and deterministic tasks:

judge: exact_match

Strips whitespace and compares strings. Score is 1.0 for match, 0.0 for mismatch.

LLM-as-Judge

For open-ended tasks where there's no single correct answer:

judge:
  type: llm
  model: gpt-4o
  rubric:
    - id: accuracy
      prompt: "Is the response factually correct?"
    - id: completeness
      prompt: "Does the response fully address the question?"

The judge LLM evaluates each criterion independently (pass/fail), and the final score is the fraction of criteria passed. Responses are cached to avoid redundant API calls.

You can also use a single-string rubric for simpler setups:

judge:
  type: llm
  model: gpt-4o-mini
  rubric: "Is the response accurate, complete, and well-written?"

Reference-free evaluation

LLM judges don't require a reference answer. If your dataset only has input fields (no expected), the judge evaluates the output purely against the input and rubric. This is useful for:

  • Tone and style checking ("Is the response professional and empathetic?")
  • Safety evaluation ("Does the response contain harmful content?")
  • Format validation ("Is the response valid JSON with the required fields?")
  • Relevance checking ("Does the response address the user's question?")
# Dataset without expected — just inputs
{"input": "Write me a professional email declining a meeting"}
{"input": "Explain quantum computing to a 10 year old"}
judge:
  type: llm
  model: gpt-4o-mini
  rubric:
    - id: tone
      prompt: "Is the response appropriately professional?"
    - id: relevance
      prompt: "Does the response directly address what the user asked for?"

When a reference answer is provided, the judge sees both the expected and actual outputs for comparison. When it's omitted, the judge evaluates the output on its own merits against the rubric criteria.

Custom Judge

Write your own scoring logic in Python:

judge:
  type: custom
  module: ./my_judge.py
  function: evaluate

Your function receives three strings — llmci always passes string values at the Python boundary:

def evaluate(input: str, expected: str, actual: str) -> dict:
    # input / expected: from the dataset row (expected is "" when omitted)
    # actual: target output string
    return {"score": 1.0, "reason": "Looks good"}

Standard eval datasets require input as a JSON string. Use type hints like Any only if you parse JSON inside the function yourself.

Return a dict with score (0.0–1.0) and optionally reason.

Composite Judge (Agents)

Combine multiple evaluation criteria for agent workflows:

judge:
  type: composite
  criteria:
    - name: constraints
      type: constraint
      weight: 1.0
    - name: outcome
      type: outcome
      weight: 2.0
    - name: trajectory
      type: trajectory
      weight: 1.0
      rubric: "Did the agent use tools efficiently?"

See Agent Evaluation for details.

RAG Judge

First-class metrics for retrieval-augmented pipelines. Each criterion surfaces as a gateable metric by name:

judge:
  type: rag
  model: gpt-4o-mini
  criteria:
    - {name: faithfulness,        type: faithfulness}
    - {name: retrieval_recall,    type: retrieval_recall,    k: 5}
    - {name: retrieval_precision, type: retrieval_precision, k: 5}

Command targets write structured output; gold retrieval labels use relevant_ids on each dataset row. Retrieval criteria are deterministic (no API key). See the RAG case study and examples/12-rag-retrieval.

Safety Judge

Gate on PII leakage, toxicity, and jailbreak resistance. Higher scores are safer:

judge:
  type: safety
  model: gpt-4o-mini
  criteria:
    - {name: pii_leakage,          type: pii_leakage}
    - {name: jailbreak_resistance, type: jailbreak_resistance}

pii_leakage is deterministic — it scans for emails, phones, SSNs, credit cards, IPv4, and AWS keys. Narrow with categories: [email, ssn] or exempt known-safe values with allow_list: [support@acme.com] / allow_list: [regex:@example\.com$]. Generate adversarial inputs with llmci redteam generate (examples/15-redteam).

Pairwise Judge

Compare each output against the baseline output for the same input and report a win_rate metric. Position-swap averaging controls LLM position bias by default. Requires --compare-to (or committed baselines with per-example outputs).

Structured-Output Judge

Validate JSON output against a JSON Schema (inline or a .json file). Deterministic, no API key. See examples/16-structured-output.


Metrics

Metrics aggregate per-example judge scores into a single number.

Score-Based

MetricDescriptionBest for
accuracyFraction of examples with score = 1.0Classification
pass_rateFraction of examples with score ≥ 0.5Open-ended tasks
mean_scoreAverage judge score across all examplesRubric-based evaluation
median_scoreMedian judge score (robust to outliers)Rubric-based evaluation
min_scoreLowest score in the datasetWorst-case analysis
max_scoreHighest score in the datasetSanity checks
error_rateFraction of examples that errored (timeout, API failure)Reliability monitoring

Classification

MetricDescriptionBest for
f1_macroMacro-averaged F1 across categoriesBalanced multi-class
f1_microMicro-averaged F1 (global TP/FP/FN)Imbalanced datasets
f1_weightedWeighted F1 by class supportImbalanced datasets
precision_macroMacro-averaged precisionWhen false positives are costly
precision_microMicro-averaged precisionImbalanced datasets
precision_weightedWeighted precision by class supportImbalanced datasets
recall_macroMacro-averaged recallWhen false negatives are costly
recall_microMicro-averaged recallImbalanced datasets
recall_weightedWeighted recall by class supportImbalanced datasets

Similarity

MetricDescriptionBest for
cosine_similarityToken-overlap cosine similarity (bag-of-words)Text generation, translation

Latency

MetricDescriptionBest for
latency_meanAverage response time (ms)Performance budgets
latency_p50Median response time (ms)Typical performance
latency_p9090th percentile response time (ms)Tail latency
latency_p9999th percentile response time (ms)Worst-case latency

Cost & Tokens (lower is better)

MetricDescriptionBest for
cost_total / cost_meanTotal and per-example cost (USD)Cost regression gates
tokens_in_mean / tokens_out_meanAverage input/output tokensToken budget monitoring
tokens_total_meanAverage combined token usageOverall spend drivers

Direct targets read usage from the provider. When litellm cannot price a model (internal proxies), set settings.price_overrides with per-model input_per_token / output_per_token USD rates. Command targets can opt in by adding usage and cost to output JSON. See examples/17-integrated-ci-gate for a stacked quality + cost + safety gate.

Judge sub-scores

RAG, safety, pairwise, and composite judges expose each criterion as a gateable metric by name — e.g. faithfulness, retrieval_recall, pii_leakage, win_rate.

Threshold Modes

Absolute

The metric must meet a fixed threshold:

- name: accuracy
  threshold: 0.90
  mode: absolute   # accuracy must be ≥ 0.90

Max Regression

The drop from the baseline must not exceed a percentage:

- name: accuracy
  threshold: 0.05
  mode: max_regression   # at most 5% drop from baseline

max_regression thresholds require a stored baseline. Run llmci run --update-baseline on your main branch first. For lower-is-better metrics (cost, tokens, latency, error_rate), absolute checks invert (value must be ≤ threshold) and max_regression fails on a rise past the threshold.

Monorepos and multiple configs

Use llmci discover to find every llmci.yaml in a repo, then run all discovered configs with one command:

llmci discover
llmci run --all
llmci run --all --root services/ticket-classifier
llmci run --all --include "services/**" --exclude "services/summarizer/llmci.yaml"

Filters are matched against discovered config paths and can be repeated when you need to include or exclude several service folders.


Baselines & CI

Store baseline scores and detect regressions in pull requests.

Storing baselines

Baselines live at .llmci/baselines/{eval_name}.json — one file per eval, containing aggregate metrics, per-example outputs/scores, timestamp, and commit SHA.

Initialize on main (after your eval passes):

llmci run --update-baseline
git add .llmci/baselines/ && git commit -m "Add eval baselines"

Comparing on pull requests

Three ways to load baselines (first match wins when multiple are available):

  1. --compare-to=origin/main — read baseline files from a git ref (typical CI: checkout with fetch-depth: 0).
  2. Committed files in .llmci/baselines/ on the current branch — loaded automatically when you omit --compare-to.
  3. No baseline — absolute thresholds still work; max_regression and pairwise judges warn and skip.
llmci run --compare-to=origin/main

On main pushes, re-run with --update-baseline to refresh committed baselines after intentional changes.

Baselines store per-example outputs; regressed examples show an Output Diffs vs Baseline section in markdown and HTML reports.

Output formats

PR comments stay markdown. For GitLab, Bitbucket, Azure DevOps, Jenkins, or artifacts, use --output-format:

llmci run --output-format junit --output results.xml
llmci run --output-format sarif --output results.sarif
llmci run --output-format html  --output report.html
llmci run --output-format json  --output results.json

Response caching

Direct API targets cache responses under .llmci/cache/responses/ (keyed on provider, model, prompt, input). Judge LLM calls for RAG, safety, and pairwise share .llmci/cache/judges/. Use --no-cache or --refresh-cache to bypass or rebuild.

GitHub Actions

llmci auto-detects GitHub Actions and posts eval results as a PR comment.

Single job (composite action)

For one eval config per workflow run:

# .github/workflows/llmci.yml
name: llmci Evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0   # required for --compare-to git baselines
      - uses: llmci-cli/llmci@main
        with:
          compare-to: origin/main
          llmci-version: 0.4.1
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Matrix jobs (multiple services)

When parallel matrix jobs each run evals, set LLMCI_REPORT_SLICE so every job merges its report into one PR comment instead of overwriting the others:

strategy:
  matrix:
    include:
      - { service: ticket-classifier, config: llmci.yaml }
      - { service: ticket-classifier, config: llmci-gate.yaml }
      - { service: rag-qa, config: llmci.yaml }
steps:
  - uses: actions/checkout@v4
    with:
      fetch-depth: 0
  - run: pip install llmci
  - name: Run eval
    working-directory: services/${{ matrix.service }}
    env:
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      LLMCI_REPORT_SLICE: ${{ matrix.service }}/${{ matrix.config }}
    run: llmci run --config ${{ matrix.config }} --compare-to=origin/main

On main pushes, use --update-baseline instead of --compare-to. See the full pattern in llmci-testbed.

llmci uses a hidden HTML comment to find its own PR comments. Re-running the action updates the existing comment instead of creating duplicates. With LLMCI_REPORT_SLICE, each matrix job updates its own slice in that single comment.


Model Migration

Re-tune your prompt when switching models or providers.

When you move from one model to another — or across providers (OpenAI → Anthropic, etc.) — your prompt may need adjustments to maintain quality. llmci automates this:

llmci migrate \
  --from openai/gpt-4o-mini \
  --to anthropic/claude-3-haiku-20240307 \
  --eval ticket-classification \
  --optimizer-model openai/gpt-4o

Strategies

--strategyWhat it does
prompt (default)Iteratively rewrite the prompt from failure examples
few_shotGreedily add train examples as inline few-shot demos (--max-few-shot)

Per-provider proxies: --from-base-url, --to-base-url, --optimizer-base-url. See examples/19-cross-provider-migration.

How it works

  1. Split — dataset is split into train (70%), validation (15%), holdout (15%), stratified by category
  2. Baseline — evaluate the original prompt on the holdout set with the source model (this is the quality bar)
  3. Optimize — iteratively improve the prompt (rewrite or few-shot selection), evaluated on train/validation with the target model
  4. Stop — early stopping when validation score plateaus
  5. Report — evaluate the best prompt on holdout with the target model for an honest final score

Options

FlagDefaultDescription
--patience3Iterations without improvement before stopping
--max-iterations20Maximum optimization iterations
--min-improvement0.005Minimum score improvement to reset patience
--max-edit-distancenoneReject prompts that change too much
--max-few-shot5Cap for few_shot strategy

Writing changes safely

Migration is designed to be non-destructive by default:

  1. Read-only optimization — llmci reads target.prompt_file but never writes without confirmation.
  2. Report with diff — stdout includes scores, parity verdict, a unified diff of original vs optimized prompt, and iteration history.
  3. Confirm before write — only if the prompt changed, llmci prompts Write optimized prompt to disk? [y/N]. Answer N for a dry run.
  4. No automatic backup — commit your prompt to git before migrating; rollback is git checkout -- prompt.txt (or decline the write).

Requires target.prompt_file in direct API mode. The few_shot strategy inlines selected train examples into the prompt text — same confirm-before-write flow applies.


Agent Evaluation

Test tool-using and conversational agents with composite judging. Dataset and command I/O contracts: Contracts Reference — Agent command input.

Agent Scenarios

Agent eval datasets use a different format from standard JSONL. See the quick-reference table in Contracts Reference. Single-turn:

{"input": {"query": "What's the weather?"}, "expected": {"outcome": "weather info", "constraints": {"max_tool_calls": 3, "required_tools": ["get_weather"]}}}

Multi-turn:

{"turns": [{"user_message": "Check my order", "expected": {"outcome": "order status"}}, {"user_message": "Cancel it", "expected": {"outcome": "cancellation confirmation"}}]}

Agent Trace Format

Your agent command must output a trace JSON:

{
  "final_output": "Your order has been cancelled.",
  "trace": [
    {"step": 1, "type": "tool_call", "tool": "cancel_order", "args": {"id": "1234"}},
    {"step": 2, "type": "response", "content": "Order cancelled."}
  ],
  "total_tool_calls": 1,
  "total_tokens": 150
}

Building trace output

Agent evals invoke your agent as a command that reads input JSON and writes output JSON. Use TraceBuilder for mocks and custom frameworks, or the OpenAI Agents adapter for SDK runs.

TraceBuilder (any framework)

from llmci.trace import TraceBuilder

tb = TraceBuilder()
tb.tool("get_weather", {"city": "London"}, result="58°F cloudy", tokens=25)
tb.response("It's 58°F and cloudy in London.")
output = tb.to_dict()   # write to {output_file}

OpenAI Agents SDK adapter

from llmci.integrations.openai_agents import run_for_llmci_sync

result = run_for_llmci_sync(build_agent(), {"query": "Weather in Tokyo?"})
# result: final_output, trace, total_tool_calls, total_tokens

Requires pip install 'llmci[agents]' and OPENAI_API_KEY. For CI without an API key, use MOCK_LLM=1 with a TraceBuilder mock — see examples/10-agent-openai-agents.

Composite Judge Criteria

TypeHow it worksRequires LLM
constraintChecks tool call budgets, required/forbidden tools, token limitsNo
outcomeLLM evaluates if the final output matches the expected outcomeYes
trajectoryLLM evaluates the execution path against a rubricYes

Multi-Turn Modes

  • full_replay — command is invoked once per turn with cumulative conversation history built from real prior outputs
  • history_injection — command is invoked once; prior turns are injected with placeholder assistant replies

Exact JSON written to {input_file} for each mode is documented in Agent command input by mode.


Dataset Tools

Create, curate, and analyze eval datasets.

Initialize a dataset

llmci dataset init --name my-eval --type classification

Creates an empty evals/my-eval.jsonl file.

Add examples interactively

llmci dataset add --name my-eval

Prompts for input/expected pairs and appends them to the dataset.

Check dataset quality

llmci dataset check --name my-eval

Reports category distribution, underrepresented categories, duplicate inputs, class imbalance, and input length statistics.

Import from CSV or JSON

llmci dataset import --name my-eval --from data.csv
llmci dataset import --name my-eval --from data.json --input-column question --expected-column answer

Troubleshooting First Runs

Common setup errors and the fastest fix.

SymptomLikely causeFix
python3: can't open file 'run.py' llmci init created a command target, but your adapter script does not exist yet. Create the script referenced by target.command, or start from examples/01-ci-regression.
Provider auth error, such as missing OPENAI_API_KEY You chose direct mode or an LLM judge. Export the provider API key, or use a deterministic command-mode example first.
Dataset parse error JSONL requires one complete JSON object per line. Run llmci dataset check --name <eval-name> and fix the reported line.
Eval fails with a low score The actual output does not match the expected value or threshold. Inspect the per-example output in the report, then adjust the target, dataset, judge, or threshold.
max_regression is skipped There is no baseline to compare against. Run llmci run --update-baseline on your main branch, then compare PRs with --compare-to=origin/main.

CLI Reference

CommandDescription
llmci runRun evals and report results
llmci discoverList discovered llmci config files
llmci run --allRun every discovered config
llmci run --all --include "services/**"Run only discovered configs matching a glob
llmci run --all --exclude "legacy/**"Skip discovered configs matching a glob
llmci run --smokeRun on a subset of the dataset
llmci run --update-baselineSave current scores as baseline
llmci run --compare-to=mainCompare against a baseline branch
llmci run --output report.mdWrite report to a file
llmci run --output-format junit|sarif|json|htmlMachine-readable or shareable report formats
llmci run --no-cache / --refresh-cacheBypass or rebuild response/judge caches
llmci run --samples NMulti-sample runs with statistical aggregation
llmci migrateOptimize a prompt for a new model or provider
llmci judge calibrateMeasure judge↔human agreement; detect drift
llmci redteam generateGenerate adversarial inputs for safety evals
llmci initGenerate llmci.yaml interactively
llmci dataset initCreate a new eval dataset
llmci dataset addAdd examples interactively
llmci dataset checkAnalyze dataset coverage
llmci dataset importImport from CSV/JSON
llmci import-promptfooConvert a Promptfoo config

Global flags:

  • -v / --verbose — Show progress during runs
  • --debug — Full debug logging
  • --version — Show version and exit

GitHub Action

Drop llmci into any GitHub Actions workflow.

- uses: llmci-cli/llmci@main
  with:
    compare-to: origin/main       # baseline branch
    smoke: false                   # run full dataset
    working-directory: .           # dir with llmci.yaml
    llmci-version: 0.4.1            # exact package version
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

For monorepos, use config to point at one service config, or set all: "true" with optional include / exclude globs to run every discovered config.

InputDefaultDescription
compare-toorigin/mainBranch to load baselines from
smokefalseRun on a dataset subset
update-baselinefalseSave current scores as baselines
config(none)Path to a specific llmci config file
allfalseRun all discovered config files
root.Directory to search when all is true
include(none)Newline-separated globs of discovered config paths to include
exclude(none)Newline-separated globs of discovered config paths to exclude
working-directory.Directory containing llmci.yaml
output(none)Write report to a file path
github-tokengithub.tokenToken for posting PR comments
llmci-version0.4.1Exact llmci package version to install

Migrating from Promptfoo

One command to convert an existing Promptfoo config.

llmci import-promptfoo promptfooconfig.yaml

This converts:

  • providerstarget (direct API mode)
  • promptsprompt_file
  • tests[].assert → metrics with thresholds
  • tests[].vars → JSONL dataset rows

Some Promptfoo features (red teaming plugins, custom providers, JavaScript assertions) are not supported. Warnings are printed during conversion.


FastAPI Classification Service

A common pattern: a FastAPI service that classifies customer support tickets using an LLM. The service has pre-processing (text cleaning, PII redaction) and post-processing (confidence thresholds, fallback routing) around the LLM call.

Full service example: llmci-testbed/services/ticket-classifier

The risk

Any change to the service can affect predictions — not just prompt edits. A developer updating the PII redaction regex might accidentally strip keywords the model relies on. A change to the confidence threshold logic could re-route tickets incorrectly. These bugs don't show up in unit tests.

Prompt-level gating

Test the LLM call in isolation, verifying that the prompt + model produce correct classifications:

version: 1

target:
  direct:
    provider: openai
    model: gpt-4o-mini
  prompt_file: prompts/classify.txt

evals:
  - name: prompt-classification
    level: prompt
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.95
        mode: absolute

This catches prompt regressions fast — no service startup needed, no HTTP overhead. But it misses bugs in the surrounding code.

Service-level gating

Test the full pipeline by hitting the actual FastAPI endpoint. A thin wrapper script calls the service and extracts the classification:

# eval_service.py — llmci command-mode wrapper
import argparse, json, requests

parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True)
parser.add_argument("--output", required=True)
args = parser.parse_args()

data = json.loads(open(args.input).read())
resp = requests.post("http://localhost:8000/classify", json={"text": data["input"]})
result = resp.json()
json.dump({"output": result["category"]}, open(args.output, "w"))
version: 1

target:
  command: "python3 eval_service.py --input {input_file} --output {output_file}"

evals:
  - name: service-classification
    level: pipeline
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: accuracy
        threshold: 0.92
        mode: absolute
      - name: accuracy
        threshold: 0.03
        mode: max_regression

Now any change — pre-processing, post-processing, prompt, model config — is caught if it degrades the end-to-end classification quality.

Best practice: Run both levels. The prompt-level eval runs in seconds (no service startup). The service-level eval runs in CI after the service is built. Use max_regression mode on the service-level eval so the pipeline can tolerate minor drops from non-prompt changes while still catching significant regressions. See examples/08-fastapi-service for a runnable version of this pattern.

CI workflow

jobs:
  prompt-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: llmci run              # prompt-level (fast, no service needed)
        working-directory: evals/prompt

  service-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker compose up -d
      - run: sleep 5              # wait for service startup
      - run: llmci run --compare-to=main
        working-directory: evals/service

RAG Pipeline with Retrieval + Generation

A retrieval-augmented generation pipeline where a user question is embedded, matched against a vector store, and the top documents are fed as context to an LLM for answer generation.

Full service example: llmci-testbed/services/rag-qa

The risk

Bugs can appear at any stage: the embedding model could be swapped (changing retrieval quality), the chunking strategy could be adjusted (changing what context the LLM sees), or the generation prompt could be edited. Each affects the final answer differently, and prompt-level testing alone won't catch retrieval-side regressions.

Pipeline-level testing with the built-in RAG judge

Test the full pipeline end-to-end. The built-in rag judge scores retrieval quality deterministically and can gate on faithfulness/relevance with an LLM:

target:
  command: "python3 pipeline/run.py --input {input_file} --output {output_file}"

evals:
  - name: rag-qa
    level: pipeline
    dataset: ./evals/qa.jsonl
    judge:
      type: rag
      criteria:
        - {name: retrieval_recall,    type: retrieval_recall,    k: 2}
        - {name: retrieval_precision, type: retrieval_precision, k: 2}
    metrics:
      - {name: retrieval_recall,    threshold: 0.90, mode: absolute}
      - {name: retrieval_precision, threshold: 0.50, mode: absolute}

The pipeline writes structured output; gold labels use relevant_ids per row:

# output JSON from your command target
{"output": "...", "contexts": ["..."], "retrieved_ids": ["python", "docker"]}

# dataset row
{"input": "What is Python?", "relevant_ids": ["python"]}

See examples/12-rag-retrieval and the live testbed service for a deterministic, API-key-free setup.


Multi-Model Migration at Scale

An organization running GPT-4o across 12 microservices learns that pricing is changing and decides to migrate to a cheaper model. Each service has its own prompt, dataset, and quality bar.

Full service example: llmci-testbed/migration

The challenge

Manually tuning 12 prompts is weeks of work. Each service has different tolerance for quality drops — the billing classifier needs 98% accuracy, the FAQ summarizer can tolerate 90%.

Automated migration per service

Each service already has a llmci.yaml with eval datasets from CI. Migration becomes a loop:

#!/bin/bash
for service in billing-classifier faq-summarizer ticket-router ...; do
  cd services/$service
  llmci migrate \
    --from openai/gpt-4o \
    --to openai/gpt-4o-mini \
    --eval main-eval \
    --patience 5 \
    --max-iterations 30
  cd ../..
done

Cross-provider moves use the same loop with provider/model refs and per-side --from-base-url / --to-base-url when routing through internal proxies. Try --strategy few_shot when prompt rewriting is too brittle.

For each service, llmci:

  1. Establishes the quality bar on the old model (holdout score)
  2. Iteratively optimizes the prompt for the new model
  3. Prints a report with scores, a prompt diff, and parity verdict
  4. Prompts before writing — you confirm per service (Write optimized prompt to disk? [y/N])

Commit optimized prompts only after reviewing the diff. Services with remaining quality gaps get flagged for manual review — typically 1–2 out of 12, not all 12.


Customer Support Agent with Tool Use

A conversational agent that handles customer support: looks up orders, processes refunds, checks inventory, and escalates to humans. Built with an agent framework (OpenAI Agents, PydanticAI, etc.).

Full service example: llmci-testbed/services/support-agent

The risk

Agent bugs are subtle. The agent might use the wrong tool, make too many API calls (cost), call a destructive tool when it shouldn't (safety), or give correct answers via an inefficient path (latency).

Composite evaluation

Use llmci's agent evaluation with constraint, outcome, and trajectory judges weighted by importance:

evals:
  - name: support-agent
    level: agent
    mode: full_replay
    dataset: ./evals/conversations.jsonl
    judge:
      type: composite
      model: gpt-4o
      criteria:
        - name: safety
          type: constraint
          weight: 3.0           # highest weight — safety is non-negotiable
        - name: correctness
          type: outcome
          weight: 2.0
        - name: efficiency
          type: trajectory
          weight: 1.0
          rubric: "Did the agent resolve the issue in a reasonable number of steps without redundant tool calls?"

The eval dataset captures real support conversations with expected outcomes and constraints:

{"turns": [
  {"user_message": "I want to return order #5678",
   "expected": {"outcome": "return initiated",
                "constraints": {"required_tools": ["lookup_order", "initiate_return"],
                                "forbidden_tools": ["delete_account", "issue_refund"],
                                "max_tool_calls": 4}}},
  {"user_message": "Actually, can I get a refund instead?",
   "expected": {"outcome": "refund processed",
                "constraints": {"required_tools": ["issue_refund"]}}}
]}

Weight strategy: Safety constraints get the highest weight (3.0) because a tool-use violation is worse than a suboptimal trajectory. Correctness (2.0) matters more than efficiency (1.0) because a correct-but-slow answer is better than a fast-but-wrong one.

Framework integration

See Agent Evaluation for TraceBuilder, the OpenAI Agents adapter, and examples/10-agent-openai-agents.


Summarization Quality Assurance

A content platform generates article summaries for newsletters, social cards, and search snippets. The summaries are produced by an LLM given the full article text. There are no "correct" summaries — quality is subjective and multi-dimensional.

Full service example: llmci-testbed/services/summarizer

The challenge

Exact-match judging doesn't work here. Two perfectly good summaries of the same article can share zero words. What matters is whether the summary is faithful to the source, concise, and complete in covering key points. These qualities require LLM-as-Judge evaluation with clearly defined rubrics.

Multi-criteria rubric

Define an LLM-as-Judge eval with a rubric (not criteria — that field is for composite/RAG/safety judges):

evals:
  - name: summary-quality
    dataset: ./evals/summaries.jsonl
    judge:
      type: llm
      model: gpt-4o
      rubric:
        - id: faithfulness
          prompt: "Does the summary only contain claims supported by the source article? Penalize any hallucinated facts or unsupported conclusions."
        - id: completeness
          prompt: "Does the summary cover the main points of the article? Key findings, conclusions, and context should be present."
        - id: conciseness
          prompt: "Is the summary free of filler, redundancy, and unnecessary detail? It should be tight and to the point."
    metrics:
      - name: mean_score
        threshold: 0.75
        mode: absolute

Reference-free evaluation

Summaries are a natural fit for reference-free judging — there's no single correct answer to compare against. The dataset only needs an input field (the article text). The judge evaluates the generated summary against the input directly:

{"input": "Full article text about Q3 earnings..."}
{"input": "Breaking: new climate report released..."}
{"input": "A retrospective on the 2024 developer survey..."}

No expected field needed. The LLM judge compares the generated summary to the original article, checking faithfulness against the source rather than against a gold reference.

When you do have references

If your team has human-written reference summaries, include them as expected. The judge will use them as an additional signal:

{"input": "Full article about Q3 earnings...", "expected": "Company X reported 15% revenue growth in Q3, driven by..."}

What this catches

  • Prompt drift — someone tweaks the summarization prompt and faithfulness drops because the model starts embellishing
  • Model regression — a model upgrade produces verbose summaries that fail the conciseness criterion
  • Pipeline changes — a preprocessing step is modified (e.g., article truncation for context window limits) and completeness suffers because key paragraphs are cut

Rubric design tip: Write rubrics that describe failure modes, not just ideals. "Penalize any hallucinated facts" is more actionable for the judge LLM than "the summary should be accurate." See examples/03-llm-as-judge for a runnable version of this pattern.


Examples

Runnable examples in the examples/ directory.

ExampleBest forAPI key?Case Study
01-ci-regressionFirst local run; exact_match + F1No
02-model-migrationPrompt optimization across modelsUsually yesMulti-Model Migration
03-llm-as-judgeOpen-ended generation with rubric judgingYes
04-custom-judgePython custom judgeNo
05-agent-single-turnTool-using agent constraintsNoSupport Agent
06-agent-multi-turnMulti-turn conversation testingNoSupport Agent
07-pipeline-levelFull RAG pipeline end-to-endNoRAG Pipeline
08-fastapi-serviceService-level pipeline testingNoFastAPI Service
09-summarization-qaReference-free LLM judgeYesSummarization QA
10-agent-openai-agentsOpenAI Agents SDK adapterYes, unless mockedSupport Agent
11-safety-piiDeterministic PII-leakage gateNo
12-rag-retrievalDeterministic retrieval recall/precisionNoRAG Pipeline
13-plugin-judgeCustom judge + metric plugin APINo
14-judge-calibrationJudge calibration and drift detectionNo
15-redteamAdversarial dataset + safety gateNo
16-structured-outputJSON Schema validation judgeNo
17-integrated-ci-gateQuality + cost regression + safetyNo
18-multimodal-visionVision-capable direct targetYes
19-cross-provider-migrationCross-provider migrate + few-shot strategyUsually yesMulti-Model Migration

Examples 11–17 run with no API key (fully deterministic). Example 18 requires a vision-capable provider.

Run any example:

cd examples/01-ci-regression
llmci run

Each example has its own README with setup instructions.