Detailed walkthroughs for the most common ways teams use llmci, with config examples and realistic scenarios.
Gate every pull request with automated eval checks. When someone changes a prompt, model, model parameters, or any upstream code that affects LLM output, llmci detects quality drops before the change reaches production.
Someone on your team updates a prompt. The change looks fine in manual testing. The PR gets approved and merged. A week later, customer support tickets spike — the model is misclassifying 8% of inputs that it used to handle correctly. Nobody connected the dots until real users were affected.
This happens because there's no automated check that runs the full eval dataset against the change. Manual spot-checks miss edge cases. Without a regression gate, quality drift is invisible until production breaks.
Add a llmci.yaml to your repo. Point it at your eval dataset. llmci runs on every PR and compares results against the main branch baseline. If quality drops beyond your thresholds, the PR is blocked with a detailed report showing exactly which examples regressed.
llmci supports both prompt-level and pipeline-level testing:
| Level | Tests | Catches |
|---|---|---|
| Prompt-level | Prompt + model in isolation | Prompt edits, model changes, parameter changes |
| Pipeline-level | Full system (RAG, preprocessing, prompt, model) | Everything, including upstream code changes that alter what the model sees |
Pipeline-level is the default. It catches the sneaky case where the prompt is unchanged but someone refactored the retrieval logic, changed a preprocessing step, or modified the data that feeds into the prompt.
llmci ships with judges for common evaluation patterns:
| Judge | Metrics | Best for |
|---|---|---|
| Exact match | Accuracy, F1 (macro/micro/weighted), precision, recall | Classification, extraction, routing |
| LLM-as-judge | Rubric pass rate (per-criterion or aggregate) | Summarization, conversation, creative writing |
| Custom (Python) | Any metric you define | JSON schema validation, business rules, regex, multi-field checks |
For domain-specific evaluation logic that built-in judges can't cover, write a Python function:
Custom judges run locally — no LLM calls, no latency, no cost. Use them for JSON schema validation, business rule checks, regex patterns, or any evaluation logic specific to your domain.
"F1 must be above 0.93, period." A quality floor that every PR must meet regardless of the baseline.
"Score must not drop more than 5% from main branch." Answers the real question: did this PR make things worse?
One step in your workflow file. llmci exits 0 (pass) or 1 (fail), so it works as a CI gate without any special integration.
Your team runs a support ticket classifier. A developer updates the system prompt to improve handling of billing questions. The PR passes code review. llmci runs 243 eval examples and finds that while billing accuracy improved by 4%, hardware classification accuracy dropped 8%. The PR is blocked with a report showing the 7 specific examples that regressed, so the developer can fix the prompt before merging.
Try it yourself: examples/01-ci-regression — a runnable ticket classification example with eval dataset, config, and prompt.
When you need to upgrade a model or switch providers, llmci automatically tunes your prompt to achieve parity on the new model. No manual prompt engineering. No guesswork. No fire drill.
Your model is being deprecated. Or a newer model offers better price/performance. You swap the model name in your config, run your evals, and scores drop 6%. Now you're in a multi-day prompt engineering cycle: tweak the prompt, re-run evals, check if you fixed one thing but broke another. Multiply this by every prompt in your system.
One command. llmci runs an iterative optimization loop — analogous to gradient descent — that makes small, targeted edits to your prompt until it achieves parity on the new model. It uses holdout validation and early stopping to prevent overfitting.
The algorithm is modeled on gradient descent, with controls borrowed from ML training:
| ML Concept | llmci Equivalent |
|---|---|
| Step size / learning rate | How many changes per iteration. Small steps aid debuggability. |
| Training set (70%) | Examples the optimizer uses to score each iteration |
| Validation set (15%) | Separate examples checked each iteration for early stopping |
| Holdout set (15%) | Only evaluated at the end — the honest final score |
| Early stopping | Halt when improvement plateaus for N iterations |
The optimizer LLM is constrained to make minimal changes per iteration. It's instructed to prefer rewording existing instructions over adding new ones, and to never rewrite from scratch. This makes each iteration's contribution clear and debuggable.
A migration report with the optimized prompt diff, per-iteration scores, holdout validation results, and a breakdown of any remaining regressions with failure pattern analysis. You review the diff, commit it, and your CI evals confirm the migration is clean.
Your company uses GPT-4o across 12 classification prompts. OpenAI announces GPT-4o is being retired in 60 days. Instead of a month of manual prompt engineering, you run llmci migrate for each prompt. 10 of 12 converge to within 1% of the original baseline. The remaining 2 require minor manual adjustments flagged in the migration report. The entire migration takes 2 days instead of 4 weeks.
Try it yourself: examples/02-model-migration — migrate a ticket classifier from GPT-4o to GPT-4.5.
Agents make sequences of decisions — tool calls, routing, branching — often across multi-turn conversations. llmci evaluates the full trajectory, not just the final output.
Your support agent handles subscription cancellations. Someone updates the refund tool's API response format. The agent's prompts haven't changed. But now it misinterprets the refund confirmation at step 3, tells the customer their refund failed when it actually succeeded, and tries the refund again. The duplicate refund costs you real money. An output-only test would have missed this — the final output was still "refund processed."
llmci evaluates agents with a composite judge that checks four dimensions:
LLM-as-judge: did the agent achieve the correct final result?
LLM-as-judge: was the execution path logical and efficient?
Deterministic: tool call budgets, token budgets, required/forbidden tools.
Deterministic: total tokens, latency. Did the agent stay within budget?
Many agent interactions are multi-turn. llmci supports conversation scenarios where each turn has its own expected outcome and constraints, plus conversation-level budgets across all turns.
| Level | How | Catches |
|---|---|---|
| Agent-level (unit) | Your command runs the agent with mocked tools | Prompt changes, model changes, routing logic changes |
| Pipeline-level (integration) | Your command runs the agent with real tools | Everything, including tool API changes, database schema changes |
llmci doesn't need to know the difference — it just evaluates the output trace. The mocked-vs-real distinction lives in your command, not in llmci's config.
Agent evals run your agent as a command that reads input JSON and writes trace JSON. Use TraceBuilder for mocks and custom frameworks, or run_for_llmci_sync for the OpenAI Agents SDK. Test-time only, never in production.
Reference adapter: OpenAI Agents SDK (pip install 'llmci[agents]'). Other frameworks integrate via TraceBuilder in your run_agent.py entrypoint.
Your team deploys a customer support agent built on the OpenAI Agent SDK. A developer updates the system prompt to be more concise. The change looks clean in testing. But on multi-turn conversations where the customer changes their mind mid-conversation ("actually, re-subscribe me on the basic plan"), the shorter prompt loses context and the agent re-creates the premium subscription instead. llmci's multi-turn eval catches this: turn 3's outcome judge fails, and the trajectory judge flags the incorrect tool call.
Try it yourself: examples/05-agent-single-turn, examples/06-agent-multi-turn, and examples/10-agent-openai-agents (OpenAI Agents SDK adapter)
Building eval datasets is the biggest adoption barrier. But these are CI gates, not training sets. 200 carefully chosen examples beat 2,000 auto-generated ones. llmci makes dataset creation a guided process, not a guessing game.
The default and often best approach. A domain expert writes input/expected pairs focusing on coverage of important cases. You need breadth (cover the categories, edge cases, failure modes), not depth.
The llmci dataset check command is key. It analyzes coverage gaps and tells you exactly where to add more examples. This turns "write until you feel done" into a guided process with a clear finish line.
Once you have a solid manual core, llmci offers three strategies to expand:
Export logs from Langfuse, Arize, or your own system. llmci identifies successful runs, selects diverse examples, and converts them into eval scenarios. llmci never touches production — it only consumes what you export.
Provide tool definitions and a system prompt. llmci generates realistic scenarios that exercise your full tool surface, with coverage targets to ensure edge cases are included.
Start with 20–30 hand-curated examples. llmci generates perturbations — rephrased inputs, changed parameters, combined intents, edge cases — to expand to a statistically meaningful dataset.
Your team is launching a new feature that uses an LLM to classify customer feedback. You have no eval dataset. On Monday, a product manager spends 2 hours writing 80 examples across the 6 feedback categories, focusing on edge cases they know are tricky. llmci dataset check flags that the "feature request" and "bug report" categories are underrepresented. They add 30 more targeted examples. By Tuesday afternoon, you have a 200-example dataset with solid coverage, and your CI pipeline is gated.
After OpenAI acquired Promptfoo in March 2026, teams using multiple model providers need a neutral eval tool they can trust. llmci is provider-neutral, community-owned, and offers capabilities Promptfoo never had.
llmci can import your existing Promptfoo configuration:
| Capability | Promptfoo | llmci |
|---|---|---|
| CI gating | Supported, but designed for comparison | Native — designed for pass/fail gating |
| Model migration | Not available | Automated with holdout validation |
| Pipeline-level testing | Primarily prompt-level | Full pipeline (catches upstream changes) |
| Agent evaluation | Not available | Trajectory + constraints + multi-turn |
| Dataset creation | Not available | Manual curation + automated generation |
| Relative thresholds | Limited | Configurable max-regression from baseline |
| Provider neutral | Owned by OpenAI | Community-owned, works with every provider |
Your team has been using Promptfoo for 8 months with 5 eval suites across 3 repos. After the OpenAI acquisition, your platform team decides to migrate. They run llmci import-promptfoo on each repo, review the generated configs, and switch the CI workflows to use llmci. The migration takes half a day per repo. By the end of the week, all 3 repos are running on llmci with the same eval coverage they had before — plus model migration and pipeline-level testing they didn't have before.