Use Cases

Use Case 01

CI Regression Testing

Gate every pull request with automated eval checks. When someone changes a prompt, model, model parameters, or any upstream code that affects LLM output, llmci detects quality drops before the change reaches production.

The problem

Someone on your team updates a prompt. The change looks fine in manual testing. The PR gets approved and merged. A week later, customer support tickets spike — the model is misclassifying 8% of inputs that it used to handle correctly. Nobody connected the dots until real users were affected.

This happens because there's no automated check that runs the full eval dataset against the change. Manual spot-checks miss edge cases. Without a regression gate, quality drift is invisible until production breaks.

How llmci solves it

Add a llmci.yaml to your repo. Point it at your eval dataset. llmci runs on every PR and compares results against the main branch baseline. If quality drops beyond your thresholds, the PR is blocked with a detailed report showing exactly which examples regressed.

llmci.yaml

version: 1

target:
  command: "python run_prompt.py --input {input_file} --output {output_file}"

evals:
  - name: ticket-classification
    level: pipeline
    dataset: ./evals/ticket_classification.jsonl
    judge: exact_match
    metrics:
      - name: f1_macro
        threshold: 0.93
        mode: absolute
      - name: accuracy
        threshold: 0.02
        mode: max_regression

Two testing levels

llmci supports both prompt-level and pipeline-level testing:

Level	Tests	Catches
Prompt-level	Prompt + model in isolation	Prompt edits, model changes, parameter changes
Pipeline-level	Full system (RAG, preprocessing, prompt, model)	Everything, including upstream code changes that alter what the model sees

Pipeline-level is the default. It catches the sneaky case where the prompt is unchanged but someone refactored the retrieval logic, changed a preprocessing step, or modified the data that feeds into the prompt.

Available judges and metrics

llmci ships with judges for common evaluation patterns:

Judge	Metrics	Best for
Exact match	Accuracy, F1 (macro/micro/weighted), precision, recall	Classification, extraction, routing
LLM-as-judge	Rubric pass rate (per-criterion or aggregate)	Summarization, conversation, creative writing
Custom (Python)	Any metric you define	JSON schema validation, business rules, regex, multi-field checks

Custom judges

For domain-specific evaluation logic that built-in judges can't cover, write a Python function:

judges/schema_judge.py

import json

def evaluate(input: str, expected: str, actual: str) -> dict:
    try:
        parsed = json.loads(actual)
        valid = all(k in parsed for k in ["category", "confidence"])
        return {"score": 1.0 if valid else 0.0}
    except json.JSONDecodeError:
        return {"score": 0.0, "reason": "Invalid JSON"}

llmci.yaml

evals:
  - name: schema-validation
    judge:
      type: custom
      module: ./judges/schema_judge.py
      function: evaluate

Custom judges run locally — no LLM calls, no latency, no cost. Use them for JSON schema validation, business rule checks, regex patterns, or any evaluation logic specific to your domain.

Two threshold modes

Absolute threshold

"F1 must be above 0.93, period." A quality floor that every PR must meet regardless of the baseline.

Relative threshold

"Score must not drop more than 5% from main branch." Answers the real question: did this PR make things worse?

GitHub Actions setup

One step in your workflow file. llmci exits 0 (pass) or 1 (fail), so it works as a CI gate without any special integration.

.github/workflows/llmci.yml

name: LLM Content Tests
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install llmci
      - run: llmci run --compare-to=origin/main
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Real-world scenario

Your team runs a support ticket classifier. A developer updates the system prompt to improve handling of billing questions. The PR passes code review. llmci runs 243 eval examples and finds that while billing accuracy improved by 4%, hardware classification accuracy dropped 8%. The PR is blocked with a report showing the 7 specific examples that regressed, so the developer can fix the prompt before merging.

Try it yourself: examples/01-ci-regression — a runnable ticket classification example with eval dataset, config, and prompt.

Use Case 02

Model Migration

When you need to upgrade a model or switch providers, llmci automatically tunes your prompt to achieve parity on the new model. No manual prompt engineering. No guesswork. No fire drill.

The problem

Your model is being deprecated. Or a newer model offers better price/performance. You swap the model name in your config, run your evals, and scores drop 6%. Now you're in a multi-day prompt engineering cycle: tweak the prompt, re-run evals, check if you fixed one thing but broke another. Multiply this by every prompt in your system.

How llmci solves it

One command. llmci runs an iterative optimization loop — analogous to gradient descent — that makes small, targeted edits to your prompt until it achieves parity on the new model. It uses holdout validation and early stopping to prevent overfitting.

Terminal

$ llmci migrate --from gpt-4o --to gpt-4.5 --eval ticket-classification

Running baseline on gpt-4o...          0.952
Running on gpt-4.5 (current prompt)...  0.891

Optimizing prompt...
  Iteration 1:  train 0.912  val 0.908
  Iteration 2:  train 0.931  val 0.925
  Iteration 3:  train 0.944  val 0.939
  Iteration 4:  train 0.949  val 0.941
  Iteration 5:  train 0.950  val 0.942  (early stop)

✓ Migration complete
  Holdout score:  0.938  (baseline: 0.952)
  Prompt diff written to ./prompts/classify.txt.migrated

How the optimization works

The algorithm is modeled on gradient descent, with controls borrowed from ML training:

ML Concept	llmci Equivalent
Step size / learning rate	How many changes per iteration. Small steps aid debuggability.
Training set (70%)	Examples the optimizer uses to score each iteration
Validation set (15%)	Separate examples checked each iteration for early stopping
Holdout set (15%)	Only evaluated at the end — the honest final score
Early stopping	Halt when improvement plateaus for N iterations

Step size control

The optimizer LLM is constrained to make minimal changes per iteration. It's instructed to prefer rewording existing instructions over adding new ones, and to never rewrite from scratch. This makes each iteration's contribution clear and debuggable.

What you get

A migration report with the optimized prompt diff, per-iteration scores, holdout validation results, and a breakdown of any remaining regressions with failure pattern analysis. You review the diff, commit it, and your CI evals confirm the migration is clean.

Real-world scenario

Your company uses GPT-4o across 12 classification prompts. OpenAI announces GPT-4o is being retired in 60 days. Instead of a month of manual prompt engineering, you run llmci migrate for each prompt. 10 of 12 converge to within 1% of the original baseline. The remaining 2 require minor manual adjustments flagged in the migration report. The entire migration takes 2 days instead of 4 weeks.

Try it yourself: examples/02-model-migration — migrate a ticket classifier from GPT-4o to GPT-4.5.

Use Case 03

Agent Testing

Agents make sequences of decisions — tool calls, routing, branching — often across multi-turn conversations. llmci evaluates the full trajectory, not just the final output.

The problem

Your support agent handles subscription cancellations. Someone updates the refund tool's API response format. The agent's prompts haven't changed. But now it misinterprets the refund confirmation at step 3, tells the customer their refund failed when it actually succeeded, and tries the refund again. The duplicate refund costs you real money. An output-only test would have missed this — the final output was still "refund processed."

How llmci solves it

llmci evaluates agents with a composite judge that checks four dimensions:

Outcome correctness

LLM-as-judge: did the agent achieve the correct final result?

Trajectory efficiency

LLM-as-judge: was the execution path logical and efficient?

Constraint enforcement

Deterministic: tool call budgets, token budgets, required/forbidden tools.

Cost accounting

Deterministic: total tokens, latency. Did the agent stay within budget?

Multi-turn conversation testing

Many agent interactions are multi-turn. llmci supports conversation scenarios where each turn has its own expected outcome and constraints, plus conversation-level budgets across all turns.

evals/support_conversations.jsonl

// Each scenario is a multi-turn conversation
{
  "turns": [
    {
      "user_message": "Cancel my subscription",
      "expected": {
        "outcome": "Subscription cancelled, confirmation provided",
        "constraints": {"max_tool_calls": 3}
      }
    },
    {
      "user_message": "Do I get a refund for the remaining days?",
      "expected": {
        "outcome": "Prorated refund calculated and communicated",
        "constraints": {"required_tools": ["calculate_prorated_refund"]}
      }
    }
  ],
  "conversation_constraints": {"max_total_tool_calls": 12}
}

Two testing levels for agents

Level	How	Catches
Agent-level (unit)	Your command runs the agent with mocked tools	Prompt changes, model changes, routing logic changes
Pipeline-level (integration)	Your command runs the agent with real tools	Everything, including tool API changes, database schema changes

llmci doesn't need to know the difference — it just evaluates the output trace. The mocked-vs-real distinction lives in your command, not in llmci's config.

Framework adapters

Agent evals run your agent as a command that reads input JSON and writes trace JSON. Use TraceBuilder for mocks and custom frameworks, or run_for_llmci_sync for the OpenAI Agents SDK. Test-time only, never in production.

Python

from llmci.integrations.openai_agents import run_for_llmci_sync

# Convert RunResult to llmci trace JSON for eval
output = run_for_llmci_sync(build_agent(), input_data)

Reference adapter: OpenAI Agents SDK (pip install 'llmci[agents]'). Other frameworks integrate via TraceBuilder in your run_agent.py entrypoint.

Real-world scenario

Your team deploys a customer support agent built on the OpenAI Agent SDK. A developer updates the system prompt to be more concise. The change looks clean in testing. But on multi-turn conversations where the customer changes their mind mid-conversation ("actually, re-subscribe me on the basic plan"), the shorter prompt loses context and the agent re-creates the premium subscription instead. llmci's multi-turn eval catches this: turn 3's outcome judge fails, and the trajectory judge flags the incorrect tool call.

Try it yourself: examples/05-agent-single-turn, examples/06-agent-multi-turn, and examples/10-agent-openai-agents (OpenAI Agents SDK adapter)

Use Case 04

Dataset Creation

Building eval datasets is the biggest adoption barrier. But these are CI gates, not training sets. 200 carefully chosen examples beat 2,000 auto-generated ones. llmci makes dataset creation a guided process, not a guessing game.

Start with manual curation

The default and often best approach. A domain expert writes input/expected pairs focusing on coverage of important cases. You need breadth (cover the categories, edge cases, failure modes), not depth.

Terminal

$ llmci dataset init --name ticket-classification --type deterministic
✓ Created ./evals/ticket_classification.jsonl

$ llmci dataset add --name ticket-classification
Input: "My printer won't connect to wifi"
Expected: "hardware"
✓ Added. (47 examples, 6 categories covered)

$ llmci dataset check --name ticket-classification
203 examples across 8 categories
⚠ "returns" has only 4 examples (min recommended: 15)
⚠ No multilingual examples detected
✓ All other categories well-covered

The llmci dataset check command is key. It analyzes coverage gaps and tells you exactly where to add more examples. This turns "write until you feel done" into a guided process with a clear finish line.

Then expand with automated generation

Once you have a solid manual core, llmci offers three strategies to expand:

Import production logs

Export logs from Langfuse, Arize, or your own system. llmci identifies successful runs, selects diverse examples, and converts them into eval scenarios. llmci never touches production — it only consumes what you export.

Terminal

$ llmci generate --from-logs ./exported-logs/ --output evals/data/dataset.jsonl

Generate from specs

Provide tool definitions and a system prompt. llmci generates realistic scenarios that exercise your full tool surface, with coverage targets to ensure edge cases are included.

Terminal

$ llmci generate --from-spec agent_config.yaml --output evals/data/agent_scenarios.jsonl

Augment existing examples

Start with 20–30 hand-curated examples. llmci generates perturbations — rephrased inputs, changed parameters, combined intents, edge cases — to expand to a statistically meaningful dataset.

Terminal

$ llmci generate --augment evals/data/seed.jsonl --output evals/data/expanded.jsonl --target-size 200

Real-world scenario

Your team is launching a new feature that uses an LLM to classify customer feedback. You have no eval dataset. On Monday, a product manager spends 2 hours writing 80 examples across the 6 feedback categories, focusing on edge cases they know are tricky. llmci dataset check flags that the "feature request" and "bug report" categories are underrepresented. They add 30 more targeted examples. By Tuesday afternoon, you have a 200-example dataset with solid coverage, and your CI pipeline is gated.

Use Case 05

Migrating from Promptfoo

After OpenAI acquired Promptfoo in March 2026, teams using multiple model providers need a neutral eval tool they can trust. llmci is provider-neutral, community-owned, and offers capabilities Promptfoo never had.

Why teams are migrating

Provider neutrality. Promptfoo is now owned by OpenAI. Teams using Anthropic, Google, Mistral, or open-source models may not trust an OpenAI-owned eval tool to remain unbiased.
Strategic direction. OpenAI likely acquired Promptfoo for red-teaming/security capabilities. General eval and CI gating may be deprioritized.
Feature gaps. Promptfoo never offered automated model migration, agentic trajectory evaluation, or eval dataset generation.

Config import

llmci can import your existing Promptfoo configuration:

Terminal

$ llmci import-promptfoo promptfooconfig.yaml

Detected 3 test suites
Converted 2 deterministic evals (exact match + F1)
Converted 1 LLM-as-judge eval (rubric-based)
Migrated 847 test cases across all suites

✓ Written to ./llmci.yaml
  Review the config and run: llmci run

What you gain

Capability	Promptfoo	llmci
CI gating	Supported, but designed for comparison	Native — designed for pass/fail gating
Model migration	Not available	Automated with holdout validation
Pipeline-level testing	Primarily prompt-level	Full pipeline (catches upstream changes)
Agent evaluation	Not available	Trajectory + constraints + multi-turn
Dataset creation	Not available	Manual curation + automated generation
Relative thresholds	Limited	Configurable max-regression from baseline
Provider neutral	Owned by OpenAI	Community-owned, works with every provider

Real-world scenario

Your team has been using Promptfoo for 8 months with 5 eval suites across 3 repos. After the OpenAI acquisition, your platform team decides to migrate. They run llmci import-promptfoo on each repo, review the generated configs, and switch the CI workflows to use llmci. The migration takes half a day per repo. By the end of the week, all 3 repos are running on llmci with the same eval coverage they had before — plus model migration and pipeline-level testing they didn't have before.

CI Regression Testing

The problem

How llmci solves it

Two testing levels

Available judges and metrics

Custom judges

Two threshold modes

Absolute threshold

Relative threshold

GitHub Actions setup

Real-world scenario

Model Migration

The problem

How llmci solves it

How the optimization works

Step size control

What you get

Real-world scenario

Agent Testing

The problem

How llmci solves it

Outcome correctness

Trajectory efficiency

Constraint enforcement

Cost accounting

Multi-turn conversation testing

Two testing levels for agents

Framework adapters

Real-world scenario

Dataset Creation

Start with manual curation

Then expand with automated generation

Import production logs

Generate from specs

Augment existing examples

Real-world scenario

Migrating from Promptfoo

Why teams are migrating

Config import

What you gain

Real-world scenario

Ready to get started?