Missing CI quality gate for LLM-powered services

Stop merging code that silently
degrades your LLM app

llmci runs black-box evals against your app in CI, compares behavior against a baseline, and blocks PRs when quality regresses.

The prompt did not change. A refactor changed context formatting. The model still answered, but the product got worse.

Pre-merge protection Baseline comparison Not a platform Any stack
pip install llmci

CLI command: llmci  ·  config: llmci.yaml

Pull request #1: demo — break classifier billing keywords
llmci-testbed · 8 eval configs · compares to origin/main
Merge blocked
GH github-actions bot commented just now
...
ticket-classifier · llmci.yaml
llmci Eval Report
Eval Metric Score Threshold Status
service-classification accuracy 0.583 ≥ 0.75
service-classification f1_weighted 0.650 ≥ 0.70
rag-qa retrieval_recall 1.000 ≥ 0.90
ticket-safety pii_leakage 1.000 ≥ 1.0
Regressions Detected
service-classification / accuracy: Score 0.583 < threshold 0.75
▼ service-classification: 10 failed
Input (truncated) Expected Got
I was charged $49.99 twice for the same subscri... billing general
I want to cancel my subscription and get a pror... billing general

A CI check, not an eval platform

CI quality gate Not a platform Not observability Not vendor-owned

Works with any provider via litellm. Your keys, your infra, your repo.

01

LLM regressions do not only come from prompt changes.

LLM behavior can degrade from ordinary engineering work: the kind of changes that look safe in code review and still change what users experience.

Retrieval changesDifferent chunks, ranking, filters, or missing documents.
Context formattingSmall template shifts that change what the model attends to.
TruncationToken limits silently drop the evidence the answer needs.
PreprocessingNormalization and cleaning change the task before the prompt.
Output parsingA stricter parser turns acceptable answers into failures.
ThresholdsConfidence, routing, or fallback thresholds drift over time.
Model swapsThe provider or model changes, but product behavior must hold.
RefactorsDependencies and app code move, and nobody reruns behavioral checks.
02

Set up in five minutes.
Protected on every PR.

Add a YAML file and one CI step. llmci runs your evals, compares against your main-branch baseline, and exits 0 or 1. In monorepos, llmci discover and llmci run --all find and run every service config.

01

Define your evals

Add a llmci.yaml to your repo. Point it at your dataset, target command, metrics, and thresholds.

02

Run in CI

One step in your GitHub Action or CI pipeline. llmci runs evals and compares against your main branch baseline.

03

Gate every PR

If behavior degrades beyond your thresholds, the required check fails and the PR is blocked with a regression report.

GitHub PR CI Check llmci run Baseline diff ✓ Merge ✗ Block PR Gate
03

Test the system boundary your users actually hit.

llmci does not care whether your app uses RAG, agents, LangChain, custom services, OpenAI, Anthropic, or local models. Give it inputs, run your existing app, judge the outputs.

eval input

JSONL examples from your repo or object storage.

system boundary

your app / script / service

The real code path, wrapped by a thin command contract.

actual output

What the branch produced for each example.

judge + metrics

Exact match, F1, accuracy, or LLM-as-judge rubrics.

pass/fail CI

Block unsafe regressions before merge.

04

Ticket classifier CI in four files.

This is not a new platform. These are just files added to your existing repo. Your app code stays unchanged, wrapped by a thin eval script.

Your repo
  • your-repo/
    • src/
      • classifier.py
    • pyproject.toml
    • llmci.yaml
    • evals/
      • tickets.jsonl
    • run_prompt.py
    • .github/
      • workflows/
        • llmci.yml

Gray files are your existing app. Run llmci init to scaffold the llmci additions.

evals/tickets.jsonl
{"input": "My printer won't connect to wifi", "expected": "hardware"} {"input": "I need a refund for order #882", "expected": "billing"} ... 18 more examples
run_prompt.py
# Your existing logic — llmci just calls this data = json.loads(open(args.input).read()) result = classify(data["input"]) json.dump({"output": result}, open(args.output, "w"))
llmci.yaml
target: command: "python3 run_prompt.py ..." evals: - name: ticket-classification dataset: ./evals/tickets.jsonl judge: exact_match metrics: - { name: f1_macro, threshold: 0.85 } - { name: accuracy, threshold: 0.90 }
.github/workflows/llmci.yml
name: llmci Evals on: [pull_request] jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: llmci-cli/llmci@main with: compare-to: origin/main

Commit these four files. On every pull request, CI runs llmci run against your dataset, compares scores to the main branch baseline, and posts an eval report as a PR comment — metrics, thresholds, and any failed examples. If quality drops, the check fails and the PR is blocked.

05

Not a platform.
A CI check.

llmci does not want to own your prompts, traces, datasets, or production logs. It lives in your repo, runs in your CI, and exits 0 or 1.

llmci owns CI gate

  • A required check for pull requests
  • A config file committed with your code
  • A regression report posted where engineers review changes
  • An exit code your CI already understands

You keep your repo

  • Your app architecture stays in your framework and deployment path
  • Your eval data stays in your repo, bucket, or private storage
  • Your production logs stay in your observability stack
  • Your model providers stay with your keys and infrastructure
06

Your LLM quality,
version-controlled

One YAML file defines the eval gate: quality floors, cost/token regression vs baseline, RAG retrieval metrics, and deterministic safety checks—all in the same PR comment.

  • Black-box testing

    Runs your pipeline as a subprocess. llmci doesn't need to know your stack—just inputs and outputs.

  • Built-in judges

    Exact match and F1 for classification; RAG retrieval; safety (PII scan); pairwise vs baseline; structured JSON Schema validation.

  • Upstream-aware

    Pipeline-level evals catch regressions from retrieval, preprocessing, and data changes—not just prompt edits.

  • Flexible thresholds

    Absolute quality floors or max-regression limits on accuracy, cost, tokens, and safety. JUnit/SARIF/HTML for any CI provider.

  • Local or remote datasets

    ./evals/tickets.jsonl in your repo, or s3:// and https:// URIs with optional caching.

llmci.yaml
version: 1 target: command: "python run_prompt.py ..." evals: - name: ticket-classification dataset: ./evals/tickets.jsonl judge: exact_match metrics: - name: f1_macro threshold: 0.93 mode: absolute - name: response-quality dataset: ./evals/responses.jsonl judge: type: llm rubric: - id: factual_accuracy prompt: "Is the response correct?" metrics: - name: pass_rate threshold: 0.03 mode: max_regression
07

Built on the same eval gate.

Once your eval suite is running in CI, you can reuse it for migration and dataset workflows without changing the core adoption story.

Coming from Promptfoo?

Keep the homepage focus on CI gating. When you need migration details, see the Promptfoo migration guide →

08

Use the focused tool for the job.

llmci is not trying to replace observability, prompt workbenches, RAG evaluators, or Python test libraries. It is best when the job is blocking LLM regressions before merge.

Need Best fit
Block PRs when LLM behavior regresses llmci
Trace production LLM calls Langfuse / Opik / Braintrust
Compare prompts and models interactively Promptfoo / Braintrust
Red-team for jailbreaks Promptfoo / Giskard
Evaluate RAG quality Ragas / DeepEval
Write Pythonic LLM tests DeepEval
Run black-box evals against any app boundary in CI llmci
09

Notes from building
LLM regression gates.

Practical guides for teams moving evals from notebooks into pull requests.

Make LLM regressions
a failed check

Run black-box evals in CI, compare against baseline, and block unsafe PRs before they merge.