llmci runs black-box evals against your app in CI, compares behavior against a baseline, and blocks PRs when quality regresses.
The prompt did not change. A refactor changed context formatting. The model still answered, but the product got worse.
pip install llmci
CLI command: llmci · config: llmci.yaml
LLM behavior can degrade from ordinary engineering work: the kind of changes that look safe in code review and still change what users experience.
Add a YAML file and one CI step. llmci runs your evals, compares against your main-branch baseline, and exits 0 or 1. In monorepos, llmci discover and llmci run --all find and run every service config.
Add a llmci.yaml to your repo. Point it at your dataset, target command, metrics, and thresholds.
One step in your GitHub Action or CI pipeline. llmci runs evals and compares against your main branch baseline.
If behavior degrades beyond your thresholds, the required check fails and the PR is blocked with a regression report.
llmci does not care whether your app uses RAG, agents, LangChain, custom services, OpenAI, Anthropic, or local models. Give it inputs, run your existing app, judge the outputs.
JSONL examples from your repo or object storage.
The real code path, wrapped by a thin command contract.
What the branch produced for each example.
Exact match, F1, accuracy, or LLM-as-judge rubrics.
Block unsafe regressions before merge.
This is not a new platform. These are just files added to your existing repo. Your app code stays unchanged, wrapped by a thin eval script.
Gray files are your existing app. Run llmci init to scaffold the llmci additions.
Commit these four files. On every pull request, CI runs llmci run against your dataset, compares scores to the main branch baseline, and posts an eval report as a PR comment — metrics, thresholds, and any failed examples. If quality drops, the check fails and the PR is blocked.
llmci does not want to own your prompts, traces, datasets, or production logs. It lives in your repo, runs in your CI, and exits 0 or 1.
One YAML file defines the eval gate: quality floors, cost/token regression vs baseline, RAG retrieval metrics, and deterministic safety checks—all in the same PR comment.
Runs your pipeline as a subprocess. llmci doesn't need to know your stack—just inputs and outputs.
Exact match and F1 for classification; RAG retrieval; safety (PII scan); pairwise vs baseline; structured JSON Schema validation.
Pipeline-level evals catch regressions from retrieval, preprocessing, and data changes—not just prompt edits.
Absolute quality floors or max-regression limits on accuracy, cost, tokens, and safety. JUnit/SARIF/HTML for any CI provider.
./evals/tickets.jsonl in your repo, or s3:// and https:// URIs with optional caching.
Once your eval suite is running in CI, you can reuse it for migration and dataset workflows without changing the core adoption story.
Use the same protected eval suite to compare models, tune prompts, and validate migration candidates.
Evaluate tool use, cost budgets, and outcomes with the same black-box command contract.
Open the focused ticket-classifier demo PR to watch the prompt regression gate fail.
Start with hand-picked examples, then grow coverage with import, check, and augmentation tools.
llmci is not trying to replace observability, prompt workbenches, RAG evaluators, or Python test libraries. It is best when the job is blocking LLM regressions before merge.
| Need | Best fit |
|---|---|
| Block PRs when LLM behavior regresses | llmci |
| Trace production LLM calls | Langfuse / Opik / Braintrust |
| Compare prompts and models interactively | Promptfoo / Braintrust |
| Red-team for jailbreaks | Promptfoo / Giskard |
| Evaluate RAG quality | Ragas / DeepEval |
| Write Pythonic LLM tests | DeepEval |
| Run black-box evals against any app boundary in CI | llmci |
Practical guides for teams moving evals from notebooks into pull requests.
Walk through a public llmci-testbed classifier, the prompt diff that breaks billing routes, and the CI gate that catches it.
Use baseline comparisons, absolute floors, and focused metrics to avoid noisy gates while still blocking meaningful behavior drops.
Wrap the system boundary your users hit, evaluate tool calls and outcomes, and keep agent checks versioned with the code.
Run black-box evals in CI, compare against baseline, and block unsafe PRs before they merge.
llmci.yaml