Installation
Install llmci from PyPI. Requires Python 3.10 or later. The CLI command is llmci.
pip install llmci
For agent evals with the OpenAI Agents SDK adapter:
pip install 'llmci[agents]'
For development, install from source:
git clone https://github.com/llmci-cli/llmci.git cd llmci pip install -e ".[dev]"
Verify your installation:
llmci --version
Privacy: llmci is a CLI you run in your own CI — there is no hosted llmci SaaS and eval data stays in your repo/runner by default. If you configure direct API targets or LLM judges, prompts and outputs are sent to the providers you choose (OpenAI, Anthropic, etc.) using your API keys. Deterministic judges (exact match, RAG retrieval, PII scan, structured JSON Schema) do not call external APIs.
Quickstart
Get up and running in under 5 minutes.
1. Try a deterministic example
Start with the ticket-classifier example. It does not call an LLM provider, so it works without credentials:
git clone https://github.com/llmci-cli/llmci.git cd llmci/examples/01-ci-regression llmci run
You should see a passing eval report. This is the smallest llmci loop: a JSONL dataset, a command target, an exact_match judge, and thresholded metrics.
2. Initialize your project
Run llmci init to generate a config and starter dataset interactively:
llmci init # Prompts you for: # Target mode: command / direct # Task type: classification / open_ended / agent # Eval name: my-eval
This creates llmci.yaml and evals/my-eval.jsonl with starter examples.
If you are not sure what to pick, start with command, classification, and the default eval name. That path is deterministic and does not require an API key.
3. Add your eval data
Edit the generated JSONL file. Each line is one test case:
{"input": "My printer won't connect to wifi", "expected": "hardware"} {"input": "I need a refund for order #882", "expected": "billing"} {"input": "How do I reset my password?", "expected": "account"}
Or add examples interactively:
llmci dataset add --name my-eval
Schema rules: required vs optional fields, command I/O, judge config, and eval level values are defined in one place — the Contracts Reference.
4. Connect your target
For command mode, create the adapter script referenced by llmci.yaml. It reads the JSON file passed as --input and writes a JSON object with an output key to --output:
import argparse import json parser = argparse.ArgumentParser() parser.add_argument("--input", required=True) parser.add_argument("--output", required=True) args = parser.parse_args() row = json.load(open(args.input)) actual = classify(row["input"]) json.dump({"output": actual}, open(args.output, "w"))
For direct API mode, set the provider credential your model needs, for example OPENAI_API_KEY.
5. Run evals
llmci run
You'll see a report like this:
## llmci Eval Report | Eval | Metric | Score | Threshold | Status | |--------|----------|-------|-----------|--------| | my-eval| accuracy | 0.950 | ≥ 0.9 | ✅ |
Exit code 0 means all thresholds passed. Exit code 1 means a regression was detected — perfect for CI gates.
Which path should I start with?
| Goal | Start here | API key? |
|---|---|---|
| Try llmci locally | examples/01-ci-regression | No |
| Test a classifier or deterministic pipeline | llmci init with command + classification | No |
| Test an LLM prompt directly | llmci init with direct + open_ended | Yes |
| Gate a RAG pipeline | examples/12-rag-retrieval | No for retrieval metrics |
| Add a full CI gate | examples/17-integrated-ci-gate | No |
| Evaluate an agent | examples/05-agent-single-turn | No for constraint checks |
| Migrate models or providers | examples/19-cross-provider-migration | Usually yes |
Recommended path
- Classification / deterministic —
examples/01-ci-regression: command target +exact_match+accuracy. - CI baselines — run
llmci run --update-baselineonmain, then--compare-to=origin/mainon PRs (Baselines & CI). - Pipeline / RAG —
examples/12-rag-retrieval: command target returnscontexts/retrieved_ids; dataset rows includerelevant_ids. - Integrated gate —
examples/17-integrated-ci-gate: quality + cost regression + safety in one config. - Agents, migration, Promptfoo — optional workflows once the core gate is green (Agents, Migration, Promptfoo).
Core Concepts
The mental model behind llmci.
Eval = Unit Test for LLMs
An eval is like a test suite. It has a dataset of input/expected pairs, a target (the thing you're testing), a judge (how to score), and metric thresholds (pass/fail criteria).
Targets
The target is whatever you're testing — a prompt, a script, a full pipeline. llmci sends each input to your target and collects the output. Two modes:
- Command mode — run any executable. Language-agnostic.
- Direct mode — call an LLM API directly via litellm.
Judges
A judge scores each output. llmci includes exact match, LLM-as-judge, custom Python functions, and composite judges for agents.
Thresholds
Each metric has a threshold. Two modes:
absolute— the score must be at least X (e.g., accuracy ≥ 0.90)max_regression— the drop from baseline must be at most X% (e.g., ≤ 5% drop)
Baselines
A baseline is a snapshot of metric scores (and per-example outputs) stored under .llmci/baselines/{eval_name}.json. PRs compare against baselines to detect regressions. See Baselines & CI for the full workflow.
Contracts Reference
The authoritative reference for eval data, command I/O, judge config, and llmci.yaml fields. Use this section when wiring CI — examples elsewhere link back here.
In this section: Eval levels · Dataset rows · Command I/O · Agent command I/O · Judge config · Eval config · Metric names
Eval level values
level | Runtime effect | Dataset schema | Required eval fields |
|---|---|---|---|
pipeline (default) | Standard eval loop: load JSONL → run target → judge each example | Standard JSONL (input + optional expected) | name, dataset, judge, metrics |
prompt | Same as pipeline — documentation label for prompt-only testing | Standard JSONL | Same as pipeline |
agent | Agent runner: single- or multi-turn scenarios, trace output, composite judge | Agent JSONL (separate schema) | Above + level: agent, command target, composite judge; optional mode |
Only agent changes loader, target runner, and dataset shape. prompt vs pipeline is for humans reading the config — pick whichever label matches what you are testing.
Dataset rows (JSONL file format)
- One JSON object per line (JSONL). Blank lines are ignored.
- Each line must be a JSON
object(not an array or string). - Standard evals and agent evals use different row shapes (see below).
Quick reference
| Eval type | Required fields | Optional fields | Example JSONL row |
|---|---|---|---|
| Classification | input, expected | id, metadata, category | {"input": "Refund…", "expected": "billing"} |
| Reference-free LLM judge | input | metadata, custom tags | {"input": "Summarize…"} |
| RAG | input | expected, relevant_ids | {"input": "…", "relevant_ids": ["doc1"]} |
| Safety / red-team | input | attack, category, seed | {"input": "…", "attack": "jailbreak"} |
| Agent single-turn | input, expected | expected.constraints | {"input": {"query": "…"}, "expected": {"outcome": "…"}} |
| Agent multi-turn | turns | conversation_constraints | {"turns": [{"user_message": "…", "expected": {"outcome": "…"}}]} |
Standard eval rows (level: prompt or level: pipeline)
Used for every eval except level: agent. Each row is loaded into input, expected, and an extra bag for additional fields.
| Field | Required? | Type | Notes |
|---|---|---|---|
input | Yes | string | Prompt text or user message. Must be a JSON string (not an object) for standard evals. |
expected | Depends on judge | string | Required for exact_match and classification metrics. May be omitted for llm, rag, safety, pairwise, and structured judges (defaults to ""). |
| Any other key | No | any JSON | Allowed. Stored in extra, merged into command input files, and available to judges (e.g. id, metadata, relevant_ids, images). |
Example rows
| Use case | Example JSONL row |
|---|---|
| Classification | {"input": "Refund my subscription", "expected": "billing"} |
| Reference-free LLM judge | {"input": "Summarize this article…"} (no expected) |
| RAG retrieval labels | {"input": "What is Python?", "expected": "…", "relevant_ids": ["python"]} |
| Multimodal direct target | {"input": "Describe this", "images": ["fixtures/photo.jpg"]} |
| Red-team metadata | {"input": "…", "attack": "jailbreak", "category": "injection"} |
Command-mode input / output ({input_file} / {output_file})
For each example, llmci writes one JSON object to a temp file and substitutes its path into your command. Your script reads that file — not the JSONL directly.
{
"input": "user text",
"expected": "gold label",
// plus every field from extra, merged at the top level:
"relevant_ids": ["doc1"],
"category": "billing"
}
Your command should write one JSON object to {output_file}:
{
"output": "model or pipeline answer",
// optional — for cost/token gates:
"usage": {"tokens_in": 120, "tokens_out": 45},
"cost": 0.001,
// optional — for RAG judges (any other keys become judge metadata):
"contexts": ["passage 1"],
"retrieved_ids": ["doc1"]
}
Reserved output keys: output, usage, cost. All other keys are passed to judges as metadata.
Agent eval rows (level: agent)
Agent datasets use a separate schema. Requires a composite judge and a command-mode target. Set mode: full_replay (default) or history_injection on the eval.
Single-turn
{"input": {"query": "Return order #5678"}, "expected": {
"outcome": "return initiated",
"constraints": {
"required_tools": ["lookup_order", "initiate_return"],
"forbidden_tools": ["delete_account"],
"max_tool_calls": 4
}
}}
input may be a string or object. The command receives the object (or {"input": "…"} if string).
Multi-turn
{"turns": [
{"user_message": "What's my order status?", "expected": {"outcome": "status shown"}},
{"user_message": "Cancel it", "expected": {"outcome": "cancelled",
"constraints": {"required_tools": ["cancel_order"]}}
]}
Agent command input by mode
Agent evals also write one JSON object to {input_file} per invocation. Output is trace JSON (see Agent Evaluation).
Single-turn
If dataset input is a string, the command receives {"input": "…"}. If it is an object, that object is written as-is:
{"query": "Return order #5678"}
Multi-turn — full_replay (one command call per turn)
Turn 0 — empty history:
{
"user_message": "What's my order status?",
"history": [],
"turn_index": 0
}
Turn 1 — history includes prior user/assistant messages from actual agent outputs:
{
"user_message": "Cancel it",
"history": [
{"role": "user", "content": "What's my order status?"},
{"role": "assistant", "content": "Order #1234 is shipped."}
],
"turn_index": 1
}
Multi-turn — history_injection (one command call total)
Prior turns are pre-filled with placeholder assistant text "(prior response)"; only the final user message is executed. For a two-turn scenario:
{
"user_message": "Cancel it",
"history": [
{"role": "user", "content": "What's my order status?"},
{"role": "assistant", "content": "(prior response)"}
],
"turn_index": 1
}
Optional per-turn context from the dataset is merged into the input object when present.
Judge config schema
type: llm uses rubric only — not criteria. The word "criteria" in prose refers to rubric items evaluated pass/fail.
Judge type | Required fields | Scoring config field | Shape |
|---|---|---|---|
exact_match | — | — | Shorthand: judge: exact_match |
llm | model, rubric | rubric | String, or list of {id, prompt} |
custom | module, function | — | Python file with evaluate(input, expected, actual) — all three args are strings (see Judges) |
composite | criteria | criteria | List of {name, type, weight, …} — trajectory entries may include a nested rubric string |
rag | criteria | criteria | List of RAG criterion objects (retrieval_recall, faithfulness, …) |
safety | criteria | criteria | List of safety criterion objects (pii_leakage, toxicity, …) |
pairwise | model | rubric (optional) | String comparison instruction |
structured | json_schema | json_schema | Inline schema or path to .json file |
llmci.yaml eval fields
| Field | Required? | Description |
|---|---|---|
evals[].name | Yes | Eval identifier; baseline filename and report label. |
evals[].dataset | Yes | Path to JSONL, or S3/HTTPS {source, cache} object. |
evals[].judge | Yes | Judge config (see table above). |
evals[].metrics | Yes (for gating) | List of {name, threshold, mode} — names must match computed metrics. |
evals[].level | No | pipeline (default), prompt, or agent. |
evals[].mode | No | Agent only: full_replay (default) or history_injection. |
evals[].target | No | Override root target for this eval only. |
target | Yes | command or provider+model (+ optional prompt_file, base_url). |
settings | No | parallelism, timeout_per_call, retries, sampling, price_overrides, etc. |
Metric names
Threshold name must match a computed metric. Built-in aggregates:
accuracy, pass_rate, rubric_pass_rate (alias of pass_rate for LLM rubrics), mean_score, median_score, min_score, max_score, error_rate, f1_macro, f1_micro, f1_weighted, precision_*, recall_*, latency_mean, latency_p50, latency_p90, latency_p99, cost_total, cost_mean, tokens_in_mean, tokens_out_mean, tokens_total_mean, cosine_similarity.
Multi-criterion judges also expose each criterion by name as a metric (e.g. retrieval_recall, pii_leakage, win_rate, faithfulness). Plugins may register additional metric names.
llmci.yaml
The config file defines your target, evals, and settings. Field-level contracts (dataset rows, judge shapes, metric names) are in the Contracts Reference.
version: 1 target: command: "python3 run.py --input {input_file} --output {output_file}" evals: - name: ticket-classification dataset: ./evals/tickets.jsonl judge: exact_match metrics: - name: accuracy threshold: 0.90 mode: absolute settings: parallelism: 5 timeout_per_call: 30 retries: 1
Remote datasets (S3 / HTTPS)
Datasets can live outside the repo. Use a URI string or the object form with optional caching:
evals: - name: ticket-classification dataset: s3://company-evals/tickets.jsonl - name: response-quality dataset: source: https://example.com/evals/quality.jsonl cache: true
S3 downloads use your normal AWS credentials (env vars, ~/.aws/credentials, or IAM role in CI). Install the optional extra: pip install 'llmci[s3]'. Cached files are stored in .llmci/cache/datasets/.
| Field | Type | Description |
|---|---|---|
version | int | Config version. Always 1. |
target | object | What to test. See Targets. |
evals | list | One or more eval definitions. |
evals[].name | string | Eval identifier; used in reports and baseline filenames. |
evals[].level | string | prompt, pipeline (labels only), or agent (separate dataset schema). Default pipeline. |
evals[].dataset | string | object | Path to JSONL, or S3/HTTPS source. See Dataset Schemas. |
evals[].judge | object | string | Scoring method. See Judges. |
evals[].metrics | list | Thresholds to gate on. Names must match computed metrics. |
evals[].mode | string | Agent only: full_replay or history_injection. |
settings | object | Parallelism, timeouts, retries, sampling, price overrides. |
Targets
Define what llmci tests — a script, a service, or a direct LLM call.
Command Mode
Wrap any executable. Your script receives a JSON input file and writes a JSON output file:
target: command: "python3 my_script.py --input {input_file} --output {output_file}"
See Contracts Reference for the exact input/output JSON contracts. In short: llmci writes one merged JSON object per example to {input_file} (including input, expected, and any extra dataset fields); your script writes a JSON object with at least output to {output_file}.
Command mode is language-agnostic. Your script can be Python, Node.js, Go, a Docker container — anything that reads/writes JSON files.
Direct API Mode
Call an LLM provider directly via litellm:
target: direct: provider: openai model: gpt-4o-mini prompt_file: prompt.txt
The prompt file uses {input} as a placeholder:
Classify this ticket into: hardware, billing, account, software.
Respond with only the category name.
Ticket: {input}
Set API credentials via environment variables (e.g., OPENAI_API_KEY). All litellm-supported providers work: OpenAI, Anthropic, Azure, Bedrock, Vertex, Ollama, etc.
Dataset rows can include images and/or audio (paths relative to the dataset file, or HTTPS URLs) for multimodal direct targets. See examples/18-multimodal-vision.
Custom Base URL / Proxy
If your organization uses an internal LLM proxy or gateway, set base_url to route requests through it:
target: direct: provider: openai model: gpt-4o base_url: https://llm-proxy.internal.company.com/v1 prompt_file: prompt.txt
Alternatively, you can set the base URL via environment variables (e.g., OPENAI_API_BASE).
Judges
Judges score each example by comparing the target's output against the expected value. Config shapes and field names: Contracts Reference — Judge config.
Exact Match
For classification and deterministic tasks:
judge: exact_match
Strips whitespace and compares strings. Score is 1.0 for match, 0.0 for mismatch.
LLM-as-Judge
For open-ended tasks where there's no single correct answer:
judge: type: llm model: gpt-4o rubric: - id: accuracy prompt: "Is the response factually correct?" - id: completeness prompt: "Does the response fully address the question?"
The judge LLM evaluates each criterion independently (pass/fail), and the final score is the fraction of criteria passed. Responses are cached to avoid redundant API calls.
You can also use a single-string rubric for simpler setups:
judge: type: llm model: gpt-4o-mini rubric: "Is the response accurate, complete, and well-written?"
Reference-free evaluation
LLM judges don't require a reference answer. If your dataset only has input fields (no expected), the judge evaluates the output purely against the input and rubric. This is useful for:
- Tone and style checking ("Is the response professional and empathetic?")
- Safety evaluation ("Does the response contain harmful content?")
- Format validation ("Is the response valid JSON with the required fields?")
- Relevance checking ("Does the response address the user's question?")
# Dataset without expected — just inputs {"input": "Write me a professional email declining a meeting"} {"input": "Explain quantum computing to a 10 year old"}
judge: type: llm model: gpt-4o-mini rubric: - id: tone prompt: "Is the response appropriately professional?" - id: relevance prompt: "Does the response directly address what the user asked for?"
When a reference answer is provided, the judge sees both the expected and actual outputs for comparison. When it's omitted, the judge evaluates the output on its own merits against the rubric criteria.
Custom Judge
Write your own scoring logic in Python:
judge: type: custom module: ./my_judge.py function: evaluate
Your function receives three strings — llmci always passes string values at the Python boundary:
def evaluate(input: str, expected: str, actual: str) -> dict: # input / expected: from the dataset row (expected is "" when omitted) # actual: target output string return {"score": 1.0, "reason": "Looks good"}
Standard eval datasets require input as a JSON string. Use type hints like Any only if you parse JSON inside the function yourself.
Return a dict with score (0.0–1.0) and optionally reason.
Composite Judge (Agents)
Combine multiple evaluation criteria for agent workflows:
judge: type: composite criteria: - name: constraints type: constraint weight: 1.0 - name: outcome type: outcome weight: 2.0 - name: trajectory type: trajectory weight: 1.0 rubric: "Did the agent use tools efficiently?"
See Agent Evaluation for details.
RAG Judge
First-class metrics for retrieval-augmented pipelines. Each criterion surfaces as a gateable metric by name:
judge: type: rag model: gpt-4o-mini criteria: - {name: faithfulness, type: faithfulness} - {name: retrieval_recall, type: retrieval_recall, k: 5} - {name: retrieval_precision, type: retrieval_precision, k: 5}
Command targets write structured output; gold retrieval labels use relevant_ids on each dataset row. Retrieval criteria are deterministic (no API key). See the RAG case study and examples/12-rag-retrieval.
Safety Judge
Gate on PII leakage, toxicity, and jailbreak resistance. Higher scores are safer:
judge: type: safety model: gpt-4o-mini criteria: - {name: pii_leakage, type: pii_leakage} - {name: jailbreak_resistance, type: jailbreak_resistance}
pii_leakage is deterministic — it scans for emails, phones, SSNs, credit cards, IPv4, and AWS keys. Narrow with categories: [email, ssn] or exempt known-safe values with allow_list: [support@acme.com] / allow_list: [regex:@example\.com$]. Generate adversarial inputs with llmci redteam generate (examples/15-redteam).
Pairwise Judge
Compare each output against the baseline output for the same input and report a win_rate metric. Position-swap averaging controls LLM position bias by default. Requires --compare-to (or committed baselines with per-example outputs).
Structured-Output Judge
Validate JSON output against a JSON Schema (inline or a .json file). Deterministic, no API key. See examples/16-structured-output.
Metrics
Metrics aggregate per-example judge scores into a single number.
Score-Based
| Metric | Description | Best for |
|---|---|---|
accuracy | Fraction of examples with score = 1.0 | Classification |
pass_rate | Fraction of examples with score ≥ 0.5 | Open-ended tasks |
mean_score | Average judge score across all examples | Rubric-based evaluation |
median_score | Median judge score (robust to outliers) | Rubric-based evaluation |
min_score | Lowest score in the dataset | Worst-case analysis |
max_score | Highest score in the dataset | Sanity checks |
error_rate | Fraction of examples that errored (timeout, API failure) | Reliability monitoring |
Classification
| Metric | Description | Best for |
|---|---|---|
f1_macro | Macro-averaged F1 across categories | Balanced multi-class |
f1_micro | Micro-averaged F1 (global TP/FP/FN) | Imbalanced datasets |
f1_weighted | Weighted F1 by class support | Imbalanced datasets |
precision_macro | Macro-averaged precision | When false positives are costly |
precision_micro | Micro-averaged precision | Imbalanced datasets |
precision_weighted | Weighted precision by class support | Imbalanced datasets |
recall_macro | Macro-averaged recall | When false negatives are costly |
recall_micro | Micro-averaged recall | Imbalanced datasets |
recall_weighted | Weighted recall by class support | Imbalanced datasets |
Similarity
| Metric | Description | Best for |
|---|---|---|
cosine_similarity | Token-overlap cosine similarity (bag-of-words) | Text generation, translation |
Latency
| Metric | Description | Best for |
|---|---|---|
latency_mean | Average response time (ms) | Performance budgets |
latency_p50 | Median response time (ms) | Typical performance |
latency_p90 | 90th percentile response time (ms) | Tail latency |
latency_p99 | 99th percentile response time (ms) | Worst-case latency |
Cost & Tokens (lower is better)
| Metric | Description | Best for |
|---|---|---|
cost_total / cost_mean | Total and per-example cost (USD) | Cost regression gates |
tokens_in_mean / tokens_out_mean | Average input/output tokens | Token budget monitoring |
tokens_total_mean | Average combined token usage | Overall spend drivers |
Direct targets read usage from the provider. When litellm cannot price a model (internal proxies), set settings.price_overrides with per-model input_per_token / output_per_token USD rates. Command targets can opt in by adding usage and cost to output JSON. See examples/17-integrated-ci-gate for a stacked quality + cost + safety gate.
Judge sub-scores
RAG, safety, pairwise, and composite judges expose each criterion as a gateable metric by name — e.g. faithfulness, retrieval_recall, pii_leakage, win_rate.
Threshold Modes
Absolute
The metric must meet a fixed threshold:
- name: accuracy threshold: 0.90 mode: absolute # accuracy must be ≥ 0.90
Max Regression
The drop from the baseline must not exceed a percentage:
- name: accuracy threshold: 0.05 mode: max_regression # at most 5% drop from baseline
max_regression thresholds require a stored baseline. Run llmci run --update-baseline on your main branch first. For lower-is-better metrics (cost, tokens, latency, error_rate), absolute checks invert (value must be ≤ threshold) and max_regression fails on a rise past the threshold.
Monorepos and multiple configs
Use llmci discover to find every llmci.yaml in a repo, then run all discovered configs with one command:
llmci discover llmci run --all llmci run --all --root services/ticket-classifier llmci run --all --include "services/**" --exclude "services/summarizer/llmci.yaml"
Filters are matched against discovered config paths and can be repeated when you need to include or exclude several service folders.
Baselines & CI
Store baseline scores and detect regressions in pull requests.
Storing baselines
Baselines live at .llmci/baselines/{eval_name}.json — one file per eval, containing aggregate metrics, per-example outputs/scores, timestamp, and commit SHA.
Initialize on main (after your eval passes):
llmci run --update-baseline git add .llmci/baselines/ && git commit -m "Add eval baselines"
Comparing on pull requests
Three ways to load baselines (first match wins when multiple are available):
--compare-to=origin/main— read baseline files from a git ref (typical CI: checkout withfetch-depth: 0).- Committed files in
.llmci/baselines/on the current branch — loaded automatically when you omit--compare-to. - No baseline —
absolutethresholds still work;max_regressionand pairwise judges warn and skip.
llmci run --compare-to=origin/main
On main pushes, re-run with --update-baseline to refresh committed baselines after intentional changes.
Baselines store per-example outputs; regressed examples show an Output Diffs vs Baseline section in markdown and HTML reports.
Output formats
PR comments stay markdown. For GitLab, Bitbucket, Azure DevOps, Jenkins, or artifacts, use --output-format:
llmci run --output-format junit --output results.xml llmci run --output-format sarif --output results.sarif llmci run --output-format html --output report.html llmci run --output-format json --output results.json
Response caching
Direct API targets cache responses under .llmci/cache/responses/ (keyed on provider, model, prompt, input). Judge LLM calls for RAG, safety, and pairwise share .llmci/cache/judges/. Use --no-cache or --refresh-cache to bypass or rebuild.
GitHub Actions
llmci auto-detects GitHub Actions and posts eval results as a PR comment.
Single job (composite action)
For one eval config per workflow run:
# .github/workflows/llmci.yml name: llmci Evals on: [pull_request] jobs: eval: runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps: - uses: actions/checkout@v4 with: fetch-depth: 0 # required for --compare-to git baselines - uses: llmci-cli/llmci@main with: compare-to: origin/main llmci-version: 0.4.1 env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Matrix jobs (multiple services)
When parallel matrix jobs each run evals, set LLMCI_REPORT_SLICE so every job merges its report into one PR comment instead of overwriting the others:
strategy: matrix: include: - { service: ticket-classifier, config: llmci.yaml } - { service: ticket-classifier, config: llmci-gate.yaml } - { service: rag-qa, config: llmci.yaml } steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - run: pip install llmci - name: Run eval working-directory: services/${{ matrix.service }} env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} LLMCI_REPORT_SLICE: ${{ matrix.service }}/${{ matrix.config }} run: llmci run --config ${{ matrix.config }} --compare-to=origin/main
On main pushes, use --update-baseline instead of --compare-to. See the full pattern in llmci-testbed.
llmci uses a hidden HTML comment to find its own PR comments. Re-running the action updates the existing comment instead of creating duplicates. With LLMCI_REPORT_SLICE, each matrix job updates its own slice in that single comment.
Model Migration
Re-tune your prompt when switching models or providers.
When you move from one model to another — or across providers (OpenAI → Anthropic, etc.) — your prompt may need adjustments to maintain quality. llmci automates this:
llmci migrate \ --from openai/gpt-4o-mini \ --to anthropic/claude-3-haiku-20240307 \ --eval ticket-classification \ --optimizer-model openai/gpt-4o
Strategies
--strategy | What it does |
|---|---|
prompt (default) | Iteratively rewrite the prompt from failure examples |
few_shot | Greedily add train examples as inline few-shot demos (--max-few-shot) |
Per-provider proxies: --from-base-url, --to-base-url, --optimizer-base-url. See examples/19-cross-provider-migration.
How it works
- Split — dataset is split into train (70%), validation (15%), holdout (15%), stratified by category
- Baseline — evaluate the original prompt on the holdout set with the source model (this is the quality bar)
- Optimize — iteratively improve the prompt (rewrite or few-shot selection), evaluated on train/validation with the target model
- Stop — early stopping when validation score plateaus
- Report — evaluate the best prompt on holdout with the target model for an honest final score
Options
| Flag | Default | Description |
|---|---|---|
--patience | 3 | Iterations without improvement before stopping |
--max-iterations | 20 | Maximum optimization iterations |
--min-improvement | 0.005 | Minimum score improvement to reset patience |
--max-edit-distance | none | Reject prompts that change too much |
--max-few-shot | 5 | Cap for few_shot strategy |
Writing changes safely
Migration is designed to be non-destructive by default:
- Read-only optimization — llmci reads
target.prompt_filebut never writes without confirmation. - Report with diff — stdout includes scores, parity verdict, a unified
diffof original vs optimized prompt, and iteration history. - Confirm before write — only if the prompt changed, llmci prompts
Write optimized prompt to disk? [y/N]. AnswerNfor a dry run. - No automatic backup — commit your prompt to git before migrating; rollback is
git checkout -- prompt.txt(or decline the write).
Requires target.prompt_file in direct API mode. The few_shot strategy inlines selected train examples into the prompt text — same confirm-before-write flow applies.
Agent Evaluation
Test tool-using and conversational agents with composite judging. Dataset and command I/O contracts: Contracts Reference — Agent command input.
Agent Scenarios
Agent eval datasets use a different format from standard JSONL. See the quick-reference table in Contracts Reference. Single-turn:
{"input": {"query": "What's the weather?"}, "expected": {"outcome": "weather info", "constraints": {"max_tool_calls": 3, "required_tools": ["get_weather"]}}}
Multi-turn:
{"turns": [{"user_message": "Check my order", "expected": {"outcome": "order status"}}, {"user_message": "Cancel it", "expected": {"outcome": "cancellation confirmation"}}]}
Agent Trace Format
Your agent command must output a trace JSON:
{
"final_output": "Your order has been cancelled.",
"trace": [
{"step": 1, "type": "tool_call", "tool": "cancel_order", "args": {"id": "1234"}},
{"step": 2, "type": "response", "content": "Order cancelled."}
],
"total_tool_calls": 1,
"total_tokens": 150
}
Building trace output
Agent evals invoke your agent as a command that reads input JSON and writes output JSON. Use TraceBuilder for mocks and custom frameworks, or the OpenAI Agents adapter for SDK runs.
TraceBuilder (any framework)
from llmci.trace import TraceBuilder tb = TraceBuilder() tb.tool("get_weather", {"city": "London"}, result="58°F cloudy", tokens=25) tb.response("It's 58°F and cloudy in London.") output = tb.to_dict() # write to {output_file}
OpenAI Agents SDK adapter
from llmci.integrations.openai_agents import run_for_llmci_sync result = run_for_llmci_sync(build_agent(), {"query": "Weather in Tokyo?"}) # result: final_output, trace, total_tool_calls, total_tokens
Requires pip install 'llmci[agents]' and OPENAI_API_KEY. For CI without an API key, use MOCK_LLM=1 with a TraceBuilder mock — see examples/10-agent-openai-agents.
Composite Judge Criteria
| Type | How it works | Requires LLM |
|---|---|---|
constraint | Checks tool call budgets, required/forbidden tools, token limits | No |
outcome | LLM evaluates if the final output matches the expected outcome | Yes |
trajectory | LLM evaluates the execution path against a rubric | Yes |
Multi-Turn Modes
full_replay— command is invoked once per turn with cumulative conversation history built from real prior outputshistory_injection— command is invoked once; prior turns are injected with placeholder assistant replies
Exact JSON written to {input_file} for each mode is documented in Agent command input by mode.
Dataset Tools
Create, curate, and analyze eval datasets.
Initialize a dataset
llmci dataset init --name my-eval --type classification
Creates an empty evals/my-eval.jsonl file.
Add examples interactively
llmci dataset add --name my-eval
Prompts for input/expected pairs and appends them to the dataset.
Check dataset quality
llmci dataset check --name my-eval
Reports category distribution, underrepresented categories, duplicate inputs, class imbalance, and input length statistics.
Import from CSV or JSON
llmci dataset import --name my-eval --from data.csv llmci dataset import --name my-eval --from data.json --input-column question --expected-column answer
Troubleshooting First Runs
Common setup errors and the fastest fix.
| Symptom | Likely cause | Fix |
|---|---|---|
python3: can't open file 'run.py' |
llmci init created a command target, but your adapter script does not exist yet. |
Create the script referenced by target.command, or start from examples/01-ci-regression. |
Provider auth error, such as missing OPENAI_API_KEY |
You chose direct mode or an LLM judge. | Export the provider API key, or use a deterministic command-mode example first. |
| Dataset parse error | JSONL requires one complete JSON object per line. | Run llmci dataset check --name <eval-name> and fix the reported line. |
| Eval fails with a low score | The actual output does not match the expected value or threshold. | Inspect the per-example output in the report, then adjust the target, dataset, judge, or threshold. |
max_regression is skipped |
There is no baseline to compare against. | Run llmci run --update-baseline on your main branch, then compare PRs with --compare-to=origin/main. |
CLI Reference
| Command | Description |
|---|---|
llmci run | Run evals and report results |
llmci discover | List discovered llmci config files |
llmci run --all | Run every discovered config |
llmci run --all --include "services/**" | Run only discovered configs matching a glob |
llmci run --all --exclude "legacy/**" | Skip discovered configs matching a glob |
llmci run --smoke | Run on a subset of the dataset |
llmci run --update-baseline | Save current scores as baseline |
llmci run --compare-to=main | Compare against a baseline branch |
llmci run --output report.md | Write report to a file |
llmci run --output-format junit|sarif|json|html | Machine-readable or shareable report formats |
llmci run --no-cache / --refresh-cache | Bypass or rebuild response/judge caches |
llmci run --samples N | Multi-sample runs with statistical aggregation |
llmci migrate | Optimize a prompt for a new model or provider |
llmci judge calibrate | Measure judge↔human agreement; detect drift |
llmci redteam generate | Generate adversarial inputs for safety evals |
llmci init | Generate llmci.yaml interactively |
llmci dataset init | Create a new eval dataset |
llmci dataset add | Add examples interactively |
llmci dataset check | Analyze dataset coverage |
llmci dataset import | Import from CSV/JSON |
llmci import-promptfoo | Convert a Promptfoo config |
Global flags:
-v / --verbose— Show progress during runs--debug— Full debug logging--version— Show version and exit
GitHub Action
Drop llmci into any GitHub Actions workflow.
- uses: llmci-cli/llmci@main with: compare-to: origin/main # baseline branch smoke: false # run full dataset working-directory: . # dir with llmci.yaml llmci-version: 0.4.1 # exact package version env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
For monorepos, use config to point at one service config, or set all: "true" with optional include / exclude globs to run every discovered config.
| Input | Default | Description |
|---|---|---|
compare-to | origin/main | Branch to load baselines from |
smoke | false | Run on a dataset subset |
update-baseline | false | Save current scores as baselines |
config | (none) | Path to a specific llmci config file |
all | false | Run all discovered config files |
root | . | Directory to search when all is true |
include | (none) | Newline-separated globs of discovered config paths to include |
exclude | (none) | Newline-separated globs of discovered config paths to exclude |
working-directory | . | Directory containing llmci.yaml |
output | (none) | Write report to a file path |
github-token | github.token | Token for posting PR comments |
llmci-version | 0.4.1 | Exact llmci package version to install |
Migrating from Promptfoo
One command to convert an existing Promptfoo config.
llmci import-promptfoo promptfooconfig.yaml
This converts:
providers→target(direct API mode)prompts→prompt_filetests[].assert→ metrics with thresholdstests[].vars→ JSONL dataset rows
Some Promptfoo features (red teaming plugins, custom providers, JavaScript assertions) are not supported. Warnings are printed during conversion.
FastAPI Classification Service
A common pattern: a FastAPI service that classifies customer support tickets using an LLM. The service has pre-processing (text cleaning, PII redaction) and post-processing (confidence thresholds, fallback routing) around the LLM call.
Full service example: llmci-testbed/services/ticket-classifier
The risk
Any change to the service can affect predictions — not just prompt edits. A developer updating the PII redaction regex might accidentally strip keywords the model relies on. A change to the confidence threshold logic could re-route tickets incorrectly. These bugs don't show up in unit tests.
Prompt-level gating
Test the LLM call in isolation, verifying that the prompt + model produce correct classifications:
version: 1 target: direct: provider: openai model: gpt-4o-mini prompt_file: prompts/classify.txt evals: - name: prompt-classification level: prompt dataset: ./evals/tickets.jsonl judge: exact_match metrics: - name: accuracy threshold: 0.95 mode: absolute
This catches prompt regressions fast — no service startup needed, no HTTP overhead. But it misses bugs in the surrounding code.
Service-level gating
Test the full pipeline by hitting the actual FastAPI endpoint. A thin wrapper script calls the service and extracts the classification:
# eval_service.py — llmci command-mode wrapper import argparse, json, requests parser = argparse.ArgumentParser() parser.add_argument("--input", required=True) parser.add_argument("--output", required=True) args = parser.parse_args() data = json.loads(open(args.input).read()) resp = requests.post("http://localhost:8000/classify", json={"text": data["input"]}) result = resp.json() json.dump({"output": result["category"]}, open(args.output, "w"))
version: 1 target: command: "python3 eval_service.py --input {input_file} --output {output_file}" evals: - name: service-classification level: pipeline dataset: ./evals/tickets.jsonl judge: exact_match metrics: - name: accuracy threshold: 0.92 mode: absolute - name: accuracy threshold: 0.03 mode: max_regression
Now any change — pre-processing, post-processing, prompt, model config — is caught if it degrades the end-to-end classification quality.
Best practice: Run both levels. The prompt-level eval runs in seconds (no service startup). The service-level eval runs in CI after the service is built. Use max_regression mode on the service-level eval so the pipeline can tolerate minor drops from non-prompt changes while still catching significant regressions. See examples/08-fastapi-service for a runnable version of this pattern.
CI workflow
jobs: prompt-eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: llmci run # prompt-level (fast, no service needed) working-directory: evals/prompt service-eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: docker compose up -d - run: sleep 5 # wait for service startup - run: llmci run --compare-to=main working-directory: evals/service
RAG Pipeline with Retrieval + Generation
A retrieval-augmented generation pipeline where a user question is embedded, matched against a vector store, and the top documents are fed as context to an LLM for answer generation.
Full service example: llmci-testbed/services/rag-qa
The risk
Bugs can appear at any stage: the embedding model could be swapped (changing retrieval quality), the chunking strategy could be adjusted (changing what context the LLM sees), or the generation prompt could be edited. Each affects the final answer differently, and prompt-level testing alone won't catch retrieval-side regressions.
Pipeline-level testing with the built-in RAG judge
Test the full pipeline end-to-end. The built-in rag judge scores retrieval quality deterministically and can gate on faithfulness/relevance with an LLM:
target: command: "python3 pipeline/run.py --input {input_file} --output {output_file}" evals: - name: rag-qa level: pipeline dataset: ./evals/qa.jsonl judge: type: rag criteria: - {name: retrieval_recall, type: retrieval_recall, k: 2} - {name: retrieval_precision, type: retrieval_precision, k: 2} metrics: - {name: retrieval_recall, threshold: 0.90, mode: absolute} - {name: retrieval_precision, threshold: 0.50, mode: absolute}
The pipeline writes structured output; gold labels use relevant_ids per row:
# output JSON from your command target {"output": "...", "contexts": ["..."], "retrieved_ids": ["python", "docker"]} # dataset row {"input": "What is Python?", "relevant_ids": ["python"]}
See examples/12-rag-retrieval and the live testbed service for a deterministic, API-key-free setup.
Multi-Model Migration at Scale
An organization running GPT-4o across 12 microservices learns that pricing is changing and decides to migrate to a cheaper model. Each service has its own prompt, dataset, and quality bar.
Full service example: llmci-testbed/migration
The challenge
Manually tuning 12 prompts is weeks of work. Each service has different tolerance for quality drops — the billing classifier needs 98% accuracy, the FAQ summarizer can tolerate 90%.
Automated migration per service
Each service already has a llmci.yaml with eval datasets from CI. Migration becomes a loop:
#!/bin/bash for service in billing-classifier faq-summarizer ticket-router ...; do cd services/$service llmci migrate \ --from openai/gpt-4o \ --to openai/gpt-4o-mini \ --eval main-eval \ --patience 5 \ --max-iterations 30 cd ../.. done
Cross-provider moves use the same loop with provider/model refs and per-side --from-base-url / --to-base-url when routing through internal proxies. Try --strategy few_shot when prompt rewriting is too brittle.
For each service, llmci:
- Establishes the quality bar on the old model (holdout score)
- Iteratively optimizes the prompt for the new model
- Prints a report with scores, a prompt diff, and parity verdict
- Prompts before writing — you confirm per service (
Write optimized prompt to disk? [y/N])
Commit optimized prompts only after reviewing the diff. Services with remaining quality gaps get flagged for manual review — typically 1–2 out of 12, not all 12.
Customer Support Agent with Tool Use
A conversational agent that handles customer support: looks up orders, processes refunds, checks inventory, and escalates to humans. Built with an agent framework (OpenAI Agents, PydanticAI, etc.).
Full service example: llmci-testbed/services/support-agent
The risk
Agent bugs are subtle. The agent might use the wrong tool, make too many API calls (cost), call a destructive tool when it shouldn't (safety), or give correct answers via an inefficient path (latency).
Composite evaluation
Use llmci's agent evaluation with constraint, outcome, and trajectory judges weighted by importance:
evals: - name: support-agent level: agent mode: full_replay dataset: ./evals/conversations.jsonl judge: type: composite model: gpt-4o criteria: - name: safety type: constraint weight: 3.0 # highest weight — safety is non-negotiable - name: correctness type: outcome weight: 2.0 - name: efficiency type: trajectory weight: 1.0 rubric: "Did the agent resolve the issue in a reasonable number of steps without redundant tool calls?"
The eval dataset captures real support conversations with expected outcomes and constraints:
{"turns": [
{"user_message": "I want to return order #5678",
"expected": {"outcome": "return initiated",
"constraints": {"required_tools": ["lookup_order", "initiate_return"],
"forbidden_tools": ["delete_account", "issue_refund"],
"max_tool_calls": 4}}},
{"user_message": "Actually, can I get a refund instead?",
"expected": {"outcome": "refund processed",
"constraints": {"required_tools": ["issue_refund"]}}}
]}
Weight strategy: Safety constraints get the highest weight (3.0) because a tool-use violation is worse than a suboptimal trajectory. Correctness (2.0) matters more than efficiency (1.0) because a correct-but-slow answer is better than a fast-but-wrong one.
Framework integration
See Agent Evaluation for TraceBuilder, the OpenAI Agents adapter, and examples/10-agent-openai-agents.
Summarization Quality Assurance
A content platform generates article summaries for newsletters, social cards, and search snippets. The summaries are produced by an LLM given the full article text. There are no "correct" summaries — quality is subjective and multi-dimensional.
Full service example: llmci-testbed/services/summarizer
The challenge
Exact-match judging doesn't work here. Two perfectly good summaries of the same article can share zero words. What matters is whether the summary is faithful to the source, concise, and complete in covering key points. These qualities require LLM-as-Judge evaluation with clearly defined rubrics.
Multi-criteria rubric
Define an LLM-as-Judge eval with a rubric (not criteria — that field is for composite/RAG/safety judges):
evals: - name: summary-quality dataset: ./evals/summaries.jsonl judge: type: llm model: gpt-4o rubric: - id: faithfulness prompt: "Does the summary only contain claims supported by the source article? Penalize any hallucinated facts or unsupported conclusions." - id: completeness prompt: "Does the summary cover the main points of the article? Key findings, conclusions, and context should be present." - id: conciseness prompt: "Is the summary free of filler, redundancy, and unnecessary detail? It should be tight and to the point." metrics: - name: mean_score threshold: 0.75 mode: absolute
Reference-free evaluation
Summaries are a natural fit for reference-free judging — there's no single correct answer to compare against. The dataset only needs an input field (the article text). The judge evaluates the generated summary against the input directly:
{"input": "Full article text about Q3 earnings..."} {"input": "Breaking: new climate report released..."} {"input": "A retrospective on the 2024 developer survey..."}
No expected field needed. The LLM judge compares the generated summary to the original article, checking faithfulness against the source rather than against a gold reference.
When you do have references
If your team has human-written reference summaries, include them as expected. The judge will use them as an additional signal:
{"input": "Full article about Q3 earnings...", "expected": "Company X reported 15% revenue growth in Q3, driven by..."}
What this catches
- Prompt drift — someone tweaks the summarization prompt and faithfulness drops because the model starts embellishing
- Model regression — a model upgrade produces verbose summaries that fail the conciseness criterion
- Pipeline changes — a preprocessing step is modified (e.g., article truncation for context window limits) and completeness suffers because key paragraphs are cut
Rubric design tip: Write rubrics that describe failure modes, not just ideals. "Penalize any hallucinated facts" is more actionable for the judge LLM than "the summary should be accurate." See examples/03-llm-as-judge for a runnable version of this pattern.
Examples
Runnable examples in the examples/ directory.
| Example | Best for | API key? | Case Study |
|---|---|---|---|
01-ci-regression | First local run; exact_match + F1 | No | — |
02-model-migration | Prompt optimization across models | Usually yes | Multi-Model Migration |
03-llm-as-judge | Open-ended generation with rubric judging | Yes | — |
04-custom-judge | Python custom judge | No | — |
05-agent-single-turn | Tool-using agent constraints | No | Support Agent |
06-agent-multi-turn | Multi-turn conversation testing | No | Support Agent |
07-pipeline-level | Full RAG pipeline end-to-end | No | RAG Pipeline |
08-fastapi-service | Service-level pipeline testing | No | FastAPI Service |
09-summarization-qa | Reference-free LLM judge | Yes | Summarization QA |
10-agent-openai-agents | OpenAI Agents SDK adapter | Yes, unless mocked | Support Agent |
11-safety-pii | Deterministic PII-leakage gate | No | — |
12-rag-retrieval | Deterministic retrieval recall/precision | No | RAG Pipeline |
13-plugin-judge | Custom judge + metric plugin API | No | — |
14-judge-calibration | Judge calibration and drift detection | No | — |
15-redteam | Adversarial dataset + safety gate | No | — |
16-structured-output | JSON Schema validation judge | No | — |
17-integrated-ci-gate | Quality + cost regression + safety | No | — |
18-multimodal-vision | Vision-capable direct target | Yes | — |
19-cross-provider-migration | Cross-provider migrate + few-shot strategy | Usually yes | Multi-Model Migration |
Examples 11–17 run with no API key (fully deterministic). Example 18 requires a vision-capable provider.
Run any example:
cd examples/01-ci-regression llmci run
Each example has its own README with setup instructions.