Missing CI quality gate for LLM-powered services

Stop merging code that silently
degrades your LLM app

llmci runs black-box evals against your app in CI, compares behavior against a baseline, and blocks PRs when quality regresses.

The prompt did not change. A refactor changed context formatting. The model still answered, but the product got worse.

Pre-merge protection Baseline comparison Not a platform Any stack

Get Started See Example

pip install llmci

CLI command: llmci · config: llmci.yaml

Pull request #1: demo — break classifier billing keywords

llmci-testbed · 8 eval configs · compares to origin/main

Merge blocked

GH github-actions bot commented just now

...

ticket-classifier · llmci.yaml

llmci Eval Report

Eval	Metric	Score	Threshold	Status
service-classification	accuracy	0.583	≥ 0.75	❌
service-classification	f1_weighted	0.650	≥ 0.70	❌
rag-qa	retrieval_recall	1.000	≥ 0.90	✅
ticket-safety	pii_leakage	1.000	≥ 1.0	✅

Regressions Detected

service-classification / accuracy: Score 0.583 < threshold 0.75

▼ service-classification: 10 failed

Input (truncated)	Expected	Got
I was charged $49.99 twice for the same subscri...	billing	general
I want to cancel my subscription and get a pror...	billing	general

Why llmci

LLM regressions do not only come from prompt changes.

LLM behavior can degrade from ordinary engineering work: the kind of changes that look safe in code review and still change what users experience.

Retrieval changesDifferent chunks, ranking, filters, or missing documents.

Context formattingSmall template shifts that change what the model attends to.

TruncationToken limits silently drop the evidence the answer needs.

PreprocessingNormalization and cleaning change the task before the prompt.

Output parsingA stricter parser turns acceptable answers into failures.

ThresholdsConfidence, routing, or fallback thresholds drift over time.

Model swapsThe provider or model changes, but product behavior must hold.

RefactorsDependencies and app code move, and nobody reruns behavioral checks.

How It Works

Set up in five minutes.
Protected on every PR.

Add a YAML file and one CI step. llmci runs your evals, compares against your main-branch baseline, and exits 0 or 1. In monorepos, llmci discover and llmci run --all find and run every service config.

Define your evals

Add a llmci.yaml to your repo. Point it at your dataset, target command, metrics, and thresholds.

Run in CI

One step in your GitHub Action or CI pipeline. llmci runs evals and compares against your main branch baseline.

Gate every PR

If behavior degrades beyond your thresholds, the required check fails and the PR is blocked with a regression report.

Black-Box Evaluation

Test the system boundary your users actually hit.

llmci does not care whether your app uses RAG, agents, LangChain, custom services, OpenAI, Anthropic, or local models. Give it inputs, run your existing app, judge the outputs.

eval input

JSONL examples from your repo or object storage.

system boundary

your app / script / service

The real code path, wrapped by a thin command contract.

actual output

What the branch produced for each example.

judge + metrics

Exact match, F1, accuracy, or LLM-as-judge rubrics.

pass/fail CI

Block unsafe regressions before merge.

Real Example

Ticket classifier CI in four files.

This is not a new platform. These are just files added to your existing repo. Your app code stays unchanged, wrapped by a thin eval script.

Your repo

your-repo/
- src/
  - classifier.py
- pyproject.toml
- …
- llmci.yaml
- evals/
  - tickets.jsonl
- run_prompt.py
- .github/
  - workflows/
    - llmci.yml

Gray files are your existing app. Run llmci init to scaffold the llmci additions.

evals/tickets.jsonl

{"input": "My printer won't connect to wifi", "expected": "hardware"}
{"input": "I need a refund for order #882", "expected": "billing"}
... 18 more examples

run_prompt.py

# Your existing logic — llmci just calls this
data = json.loads(open(args.input).read())
result = classify(data["input"])
json.dump({"output": result}, open(args.output, "w"))

llmci.yaml

target:
  command: "python3 run_prompt.py ..."
evals:
  - name: ticket-classification
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - { name: f1_macro, threshold: 0.85 }
      - { name: accuracy, threshold: 0.90 }

.github/workflows/llmci.yml

name: llmci Evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: llmci-cli/llmci@main
        with:
          compare-to: origin/main

Commit these four files. On every pull request, CI runs llmci run against your dataset, compares scores to the main branch baseline, and posts an eval report as a PR comment — metrics, thresholds, and any failed examples. If quality drops, the check fails and the PR is blocked.

Not A Platform

Not a platform.
A CI check.

llmci does not want to own your prompts, traces, datasets, or production logs. It lives in your repo, runs in your CI, and exits 0 or 1.

llmci owns CI gate

A required check for pull requests
A config file committed with your code
A regression report posted where engineers review changes
An exit code your CI already understands

You keep your repo

Your app architecture stays in your framework and deployment path
Your eval data stays in your repo, bucket, or private storage
Your production logs stay in your observability stack
Your model providers stay with your keys and infrastructure

Config as Code

Your LLM quality,
version-controlled

One YAML file defines the eval gate: quality floors, cost/token regression vs baseline, RAG retrieval metrics, and deterministic safety checks—all in the same PR comment.

⇆

Black-box testing

Runs your pipeline as a subprocess. llmci doesn't need to know your stack—just inputs and outputs.
☰

Built-in judges

Exact match and F1 for classification; RAG retrieval; safety (PII scan); pairwise vs baseline; structured JSON Schema validation.
◈

Upstream-aware

Pipeline-level evals catch regressions from retrieval, preprocessing, and data changes—not just prompt edits.
≡

Flexible thresholds

Absolute quality floors or max-regression limits on accuracy, cost, tokens, and safety. JUnit/SARIF/HTML for any CI provider.
☁

Local or remote datasets

./evals/tickets.jsonl in your repo, or s3:// and https:// URIs with optional caching.

llmci.yaml

version: 1

target:
  command: "python run_prompt.py ..."

evals:
  - name: ticket-classification
    dataset: ./evals/tickets.jsonl
    judge: exact_match
    metrics:
      - name: f1_macro
        threshold: 0.93
        mode: absolute

  - name: response-quality
    dataset: ./evals/responses.jsonl
    judge:
      type: llm
      rubric:
        - id: factual_accuracy
          prompt: "Is the response correct?"
    metrics:
      - name: pass_rate
        threshold: 0.03
        mode: max_regression
            

Compare

Use the focused tool for the job.

llmci is not trying to replace observability, prompt workbenches, RAG evaluators, or Python test libraries. It is best when the job is blocking LLM regressions before merge.

Need	Best fit
Block PRs when LLM behavior regresses	llmci
Trace production LLM calls	Langfuse / Opik / Braintrust
Compare prompts and models interactively	Promptfoo / Braintrust
Red-team for jailbreaks	Promptfoo / Giskard
Evaluate RAG quality	Ragas / DeepEval
Write Pythonic LLM tests	DeepEval
Run black-box evals against any app boundary in CI	llmci

Stop merging code that silently
degrades your LLM app

LLM regressions do not only come from prompt changes.

Set up in five minutes.
Protected on every PR.

Define your evals

Run in CI

Gate every PR

Test the system boundary your users actually hit.

eval input

your app / script / service

actual output

judge + metrics

pass/fail CI

Ticket classifier CI in four files.

Not a platform.
A CI check.

llmci owns CI gate

You keep your repo

Your LLM quality,
version-controlled

Black-box testing

Built-in judges

Upstream-aware

Flexible thresholds

Local or remote datasets

Built on the same eval gate.

Model/prompt migration →

Agent trajectory checks →

Live demo PRs →

Dataset augmentation →

Coming from Promptfoo?

Use the focused tool for the job.

Notes from building
LLM regression gates.

A prompt change that regresses an LLM classifier

The prompt did not change. The classifier still regressed.

A retrieval change that regresses a RAG app

An agent tool registry change routes users to the wrong action

Make LLM regressions
a failed check

Stop merging code that silentlydegrades your LLM app

LLM regressions do not only come from prompt changes.

Set up in five minutes.Protected on every PR.

Define your evals

Run in CI

Gate every PR

Test the system boundary your users actually hit.

eval input

your app / script / service

actual output

judge + metrics

pass/fail CI

Ticket classifier CI in four files.

Not a platform.A CI check.

llmci owns CI gate

You keep your repo

Your LLM quality,version-controlled

Black-box testing

Built-in judges

Upstream-aware

Flexible thresholds

Local or remote datasets

Built on the same eval gate.

Model/prompt migration →

Agent trajectory checks →

Live demo PRs →

Dataset augmentation →

Coming from Promptfoo?

Use the focused tool for the job.

Notes from buildingLLM regression gates.

A prompt change that regresses an LLM classifier

The prompt did not change. The classifier still regressed.

A retrieval change that regresses a RAG app

An agent tool registry change routes users to the wrong action

Make LLM regressionsa failed check

Stop merging code that silently
degrades your LLM app

Set up in five minutes.
Protected on every PR.

Not a platform.
A CI check.

Your LLM quality,
version-controlled

Notes from building
LLM regression gates.

Make LLM regressions
a failed check