Use Case 01

CI Regression Testing

Gate every pull request with automated eval checks. When someone changes a prompt, model, model parameters, or any upstream code that affects LLM output, llmci detects quality drops before the change reaches production.

The problem

Someone on your team updates a prompt. The change looks fine in manual testing. The PR gets approved and merged. A week later, customer support tickets spike — the model is misclassifying 8% of inputs that it used to handle correctly. Nobody connected the dots until real users were affected.

This happens because there's no automated check that runs the full eval dataset against the change. Manual spot-checks miss edge cases. Without a regression gate, quality drift is invisible until production breaks.

How llmci solves it

Add a llmci.yaml to your repo. Point it at your eval dataset. llmci runs on every PR and compares results against the main branch baseline. If quality drops beyond your thresholds, the PR is blocked with a detailed report showing exactly which examples regressed.

llmci.yaml
version: 1 target: command: "python run_prompt.py --input {input_file} --output {output_file}" evals: - name: ticket-classification level: pipeline dataset: ./evals/ticket_classification.jsonl judge: exact_match metrics: - name: f1_macro threshold: 0.93 mode: absolute - name: accuracy threshold: 0.02 mode: max_regression

Two testing levels

llmci supports both prompt-level and pipeline-level testing:

LevelTestsCatches
Prompt-levelPrompt + model in isolationPrompt edits, model changes, parameter changes
Pipeline-levelFull system (RAG, preprocessing, prompt, model)Everything, including upstream code changes that alter what the model sees

Pipeline-level is the default. It catches the sneaky case where the prompt is unchanged but someone refactored the retrieval logic, changed a preprocessing step, or modified the data that feeds into the prompt.

Available judges and metrics

llmci ships with judges for common evaluation patterns:

JudgeMetricsBest for
Exact matchAccuracy, F1 (macro/micro/weighted), precision, recallClassification, extraction, routing
LLM-as-judgeRubric pass rate (per-criterion or aggregate)Summarization, conversation, creative writing
Custom (Python)Any metric you defineJSON schema validation, business rules, regex, multi-field checks

Custom judges

For domain-specific evaluation logic that built-in judges can't cover, write a Python function:

judges/schema_judge.py
import json def evaluate(input: str, expected: str, actual: str) -> dict: try: parsed = json.loads(actual) valid = all(k in parsed for k in ["category", "confidence"]) return {"score": 1.0 if valid else 0.0} except json.JSONDecodeError: return {"score": 0.0, "reason": "Invalid JSON"}
llmci.yaml
evals: - name: schema-validation judge: type: custom module: ./judges/schema_judge.py function: evaluate

Custom judges run locally — no LLM calls, no latency, no cost. Use them for JSON schema validation, business rule checks, regex patterns, or any evaluation logic specific to your domain.

Two threshold modes

Absolute threshold

"F1 must be above 0.93, period." A quality floor that every PR must meet regardless of the baseline.

Relative threshold

"Score must not drop more than 5% from main branch." Answers the real question: did this PR make things worse?

GitHub Actions setup

One step in your workflow file. llmci exits 0 (pass) or 1 (fail), so it works as a CI gate without any special integration.

.github/workflows/llmci.yml
name: LLM Content Tests on: [pull_request] jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install llmci - run: llmci run --compare-to=origin/main env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Real-world scenario

Your team runs a support ticket classifier. A developer updates the system prompt to improve handling of billing questions. The PR passes code review. llmci runs 243 eval examples and finds that while billing accuracy improved by 4%, hardware classification accuracy dropped 8%. The PR is blocked with a report showing the 7 specific examples that regressed, so the developer can fix the prompt before merging.

Try it yourself: examples/01-ci-regression — a runnable ticket classification example with eval dataset, config, and prompt.

Use Case 02

Model Migration

When you need to upgrade a model or switch providers, llmci automatically tunes your prompt to achieve parity on the new model. No manual prompt engineering. No guesswork. No fire drill.

The problem

Your model is being deprecated. Or a newer model offers better price/performance. You swap the model name in your config, run your evals, and scores drop 6%. Now you're in a multi-day prompt engineering cycle: tweak the prompt, re-run evals, check if you fixed one thing but broke another. Multiply this by every prompt in your system.

How llmci solves it

One command. llmci runs an iterative optimization loop — analogous to gradient descent — that makes small, targeted edits to your prompt until it achieves parity on the new model. It uses holdout validation and early stopping to prevent overfitting.

Terminal
$ llmci migrate --from gpt-4o --to gpt-4.5 --eval ticket-classification Running baseline on gpt-4o... 0.952 Running on gpt-4.5 (current prompt)... 0.891 Optimizing prompt... Iteration 1: train 0.912 val 0.908 Iteration 2: train 0.931 val 0.925 Iteration 3: train 0.944 val 0.939 Iteration 4: train 0.949 val 0.941 Iteration 5: train 0.950 val 0.942 (early stop) ✓ Migration complete Holdout score: 0.938 (baseline: 0.952) Prompt diff written to ./prompts/classify.txt.migrated

How the optimization works

The algorithm is modeled on gradient descent, with controls borrowed from ML training:

ML Conceptllmci Equivalent
Step size / learning rateHow many changes per iteration. Small steps aid debuggability.
Training set (70%)Examples the optimizer uses to score each iteration
Validation set (15%)Separate examples checked each iteration for early stopping
Holdout set (15%)Only evaluated at the end — the honest final score
Early stoppingHalt when improvement plateaus for N iterations

Step size control

The optimizer LLM is constrained to make minimal changes per iteration. It's instructed to prefer rewording existing instructions over adding new ones, and to never rewrite from scratch. This makes each iteration's contribution clear and debuggable.

What you get

A migration report with the optimized prompt diff, per-iteration scores, holdout validation results, and a breakdown of any remaining regressions with failure pattern analysis. You review the diff, commit it, and your CI evals confirm the migration is clean.

Real-world scenario

Your company uses GPT-4o across 12 classification prompts. OpenAI announces GPT-4o is being retired in 60 days. Instead of a month of manual prompt engineering, you run llmci migrate for each prompt. 10 of 12 converge to within 1% of the original baseline. The remaining 2 require minor manual adjustments flagged in the migration report. The entire migration takes 2 days instead of 4 weeks.

Try it yourself: examples/02-model-migration — migrate a ticket classifier from GPT-4o to GPT-4.5.

Use Case 03

Agent Testing

Agents make sequences of decisions — tool calls, routing, branching — often across multi-turn conversations. llmci evaluates the full trajectory, not just the final output.

The problem

Your support agent handles subscription cancellations. Someone updates the refund tool's API response format. The agent's prompts haven't changed. But now it misinterprets the refund confirmation at step 3, tells the customer their refund failed when it actually succeeded, and tries the refund again. The duplicate refund costs you real money. An output-only test would have missed this — the final output was still "refund processed."

How llmci solves it

llmci evaluates agents with a composite judge that checks four dimensions:

Outcome correctness

LLM-as-judge: did the agent achieve the correct final result?

Trajectory efficiency

LLM-as-judge: was the execution path logical and efficient?

Constraint enforcement

Deterministic: tool call budgets, token budgets, required/forbidden tools.

Cost accounting

Deterministic: total tokens, latency. Did the agent stay within budget?

Multi-turn conversation testing

Many agent interactions are multi-turn. llmci supports conversation scenarios where each turn has its own expected outcome and constraints, plus conversation-level budgets across all turns.

evals/support_conversations.jsonl
// Each scenario is a multi-turn conversation { "turns": [ { "user_message": "Cancel my subscription", "expected": { "outcome": "Subscription cancelled, confirmation provided", "constraints": {"max_tool_calls": 3} } }, { "user_message": "Do I get a refund for the remaining days?", "expected": { "outcome": "Prorated refund calculated and communicated", "constraints": {"required_tools": ["calculate_prorated_refund"]} } } ], "conversation_constraints": {"max_total_tool_calls": 12} }

Two testing levels for agents

LevelHowCatches
Agent-level (unit)Your command runs the agent with mocked toolsPrompt changes, model changes, routing logic changes
Pipeline-level (integration)Your command runs the agent with real toolsEverything, including tool API changes, database schema changes

llmci doesn't need to know the difference — it just evaluates the output trace. The mocked-vs-real distinction lives in your command, not in llmci's config.

Framework adapters

Agent evals run your agent as a command that reads input JSON and writes trace JSON. Use TraceBuilder for mocks and custom frameworks, or run_for_llmci_sync for the OpenAI Agents SDK. Test-time only, never in production.

Python
from llmci.integrations.openai_agents import run_for_llmci_sync # Convert RunResult to llmci trace JSON for eval output = run_for_llmci_sync(build_agent(), input_data)

Reference adapter: OpenAI Agents SDK (pip install 'llmci[agents]'). Other frameworks integrate via TraceBuilder in your run_agent.py entrypoint.

Real-world scenario

Your team deploys a customer support agent built on the OpenAI Agent SDK. A developer updates the system prompt to be more concise. The change looks clean in testing. But on multi-turn conversations where the customer changes their mind mid-conversation ("actually, re-subscribe me on the basic plan"), the shorter prompt loses context and the agent re-creates the premium subscription instead. llmci's multi-turn eval catches this: turn 3's outcome judge fails, and the trajectory judge flags the incorrect tool call.

Try it yourself: examples/05-agent-single-turn, examples/06-agent-multi-turn, and examples/10-agent-openai-agents (OpenAI Agents SDK adapter)

Use Case 04

Dataset Creation

Building eval datasets is the biggest adoption barrier. But these are CI gates, not training sets. 200 carefully chosen examples beat 2,000 auto-generated ones. llmci makes dataset creation a guided process, not a guessing game.

Start with manual curation

The default and often best approach. A domain expert writes input/expected pairs focusing on coverage of important cases. You need breadth (cover the categories, edge cases, failure modes), not depth.

Terminal
$ llmci dataset init --name ticket-classification --type deterministic ✓ Created ./evals/ticket_classification.jsonl $ llmci dataset add --name ticket-classification Input: "My printer won't connect to wifi" Expected: "hardware" ✓ Added. (47 examples, 6 categories covered) $ llmci dataset check --name ticket-classification 203 examples across 8 categories ⚠ "returns" has only 4 examples (min recommended: 15) ⚠ No multilingual examples detected ✓ All other categories well-covered

The llmci dataset check command is key. It analyzes coverage gaps and tells you exactly where to add more examples. This turns "write until you feel done" into a guided process with a clear finish line.

Then expand with automated generation

Once you have a solid manual core, llmci offers three strategies to expand:

Import production logs

Export logs from Langfuse, Arize, or your own system. llmci identifies successful runs, selects diverse examples, and converts them into eval scenarios. llmci never touches production — it only consumes what you export.

Terminal
$ llmci generate --from-logs ./exported-logs/ --output evals/data/dataset.jsonl

Generate from specs

Provide tool definitions and a system prompt. llmci generates realistic scenarios that exercise your full tool surface, with coverage targets to ensure edge cases are included.

Terminal
$ llmci generate --from-spec agent_config.yaml --output evals/data/agent_scenarios.jsonl

Augment existing examples

Start with 20–30 hand-curated examples. llmci generates perturbations — rephrased inputs, changed parameters, combined intents, edge cases — to expand to a statistically meaningful dataset.

Terminal
$ llmci generate --augment evals/data/seed.jsonl --output evals/data/expanded.jsonl --target-size 200

Real-world scenario

Your team is launching a new feature that uses an LLM to classify customer feedback. You have no eval dataset. On Monday, a product manager spends 2 hours writing 80 examples across the 6 feedback categories, focusing on edge cases they know are tricky. llmci dataset check flags that the "feature request" and "bug report" categories are underrepresented. They add 30 more targeted examples. By Tuesday afternoon, you have a 200-example dataset with solid coverage, and your CI pipeline is gated.

Use Case 05

Migrating from Promptfoo

After OpenAI acquired Promptfoo in March 2026, teams using multiple model providers need a neutral eval tool they can trust. llmci is provider-neutral, community-owned, and offers capabilities Promptfoo never had.

Why teams are migrating

Config import

llmci can import your existing Promptfoo configuration:

Terminal
$ llmci import-promptfoo promptfooconfig.yaml Detected 3 test suites Converted 2 deterministic evals (exact match + F1) Converted 1 LLM-as-judge eval (rubric-based) Migrated 847 test cases across all suites ✓ Written to ./llmci.yaml Review the config and run: llmci run

What you gain

CapabilityPromptfoollmci
CI gatingSupported, but designed for comparisonNative — designed for pass/fail gating
Model migrationNot availableAutomated with holdout validation
Pipeline-level testingPrimarily prompt-levelFull pipeline (catches upstream changes)
Agent evaluationNot availableTrajectory + constraints + multi-turn
Dataset creationNot availableManual curation + automated generation
Relative thresholdsLimitedConfigurable max-regression from baseline
Provider neutralOwned by OpenAICommunity-owned, works with every provider

Real-world scenario

Your team has been using Promptfoo for 8 months with 5 eval suites across 3 repos. After the OpenAI acquisition, your platform team decides to migrate. They run llmci import-promptfoo on each repo, review the generated configs, and switch the CI workflows to use llmci. The migration takes half a day per repo. By the end of the week, all 3 repos are running on llmci with the same eval coverage they had before — plus model migration and pipeline-level testing they didn't have before.

Ready to get started?

Five-minute setup. Open source. Provider neutral.