Evals
Evals are a core mechanism for validating Serenity's output quality. They help track accuracy, identify weak spots, and ensure performance stays consistent across product updates. See Evals for more details.
Serenity automatically evaluates its responses to eval questions, checking for correctness using predefined criteria.
You can use Evals to:
- Track baseline performance over time
- Validate new releases as part of CI/CD
- Identify inconsistent or problematic cases
- Ensure responses follow specific instructions
Defining an Eval
Evals are defined in simple .yaml files containing questions, expected answers, and metadata.
The example file below demonstrates the structure of an eval including.
-
"instructions" or "target_url" (or both) can be provided as the correct answer (see Evals)
-
"conversation" field can be used to test follow-up questions in the same conversation
-
"filters" field can be used to specify a filter for retrieving documents when answering the question
questions:
snapweaver:
- conversation: # test conversation mode
- question: What is SnapWeaver?
target_url: https://snapweaver.staging.serenitygpt.com/introduction.html
- question: How to install it?
target_url: https://snapweaver.staging.serenitygpt.com/installation.html
- question: what is the capital of China? # test idk for out of topic questions
instructions: should mention a lack of knowledge
- question: How to get early beta history? Give me an exact url to it # test absolute urls
instructions: should mention this url https://github.com/SerenityGPT/snapweaver/releasesadfs
- question: How to add sh? # test synonyms
instructions: should mention "Ctrl + Shift + S" for Windows and "Cmd + Shift + S" for a Mac. Mentioning "⌘⇧ + S" is acceptable as well.
- question: was ist Snapweaver? # test language detection/translation
instructions: should give an answer in German
status: failed
- question: snapweaverとは何ですか # test if Japanese characters are translated correctly
instructions: should give an answer in Japanese
- question: Where can I read about overview of how the engine works? Give me a url. # test filters
filters:
type: eq
property: document_type
value: FAQ
instructions: should provide this url https://snapweaver.staging.serenitygpt.com/introduction.html
- question: QZ91 # test keyword search
target_url: https://snapweaver.staging.serenitygpt.com/release-notes.html
instructions: should mention v1.0.2 release
Status Field
Each question has an optional status field:
| Status | Meaning |
|---|---|
(blank) / baseline |
Baseline. Serenity is expected to get this right. |
maybe |
Maybe. Sometimes correct, sometimes incorrect. Uncertain. |
false |
False. Known weak spot — Serenity does not get this right. |
How Evals Work
For each question, Serenity generates an answer. The system automatically applies evaluation logic:
✅ URL Check If the question includes a relevant URL, we verify whether the URL was used to support the answer.
✅ Instruction Check If the question has instructions, an LLM checks if the answer conforms to them.
Running Evals
Evals can be executed from the command line:
Example:
The results will be displayed directly in the console:

Interpreting the Results
The output highlights the status of each question:
- ❌ Wrong answers for baseline (
B) questions — these indicate regressions - 🤔 Correct answers for
F(false) questions — good news, possible area of improvement
The other notation used in the eval output includes:
- the question ID to be looked up in the admin view
- B/F/M for the status of the questions: baseline, false, maybe.
- ✅/⏹️ for whether the answer is correct/incorrect
- AUI for Answer(legacy)/URL/Instructions color coded with
- grey: none provided
- green: correct
- red: incorrect
- a number or '-' following AUI: the order in search results of the target URL if provided or '0'
- the question
Evals in CI/CD
Running Evals is integrated into the CI/CD pipeline. This ensures:
- Every new release maintains or improves answer quality
- Regression detection happens automatically
- Weak spots are tracked over time
Failures in baseline Evals will cause the CI pipeline to fail, preventing faulty builds from shipping.
Reviewing Eval Details
Results of all eval run can be sent to the eval server where they can be reviewed and compared.

Clicking "View Details" takes you to the particular run.

You can pick a question within the run to see its details including the LLM-as-a-judge verdict on whether the answer conforms to the instructions.

Best Practices
✅ Keep Evals updated as product requirements evolve
✅ Add new questions when bugs are fixed or new features ship
✅ Review "maybe" and "false" questions regularly — improvements may shift their status
✅ Use Evals during both local development and automated CI
Eval server
We use an eval server to keep all evaluations in one place, making it easier to track quality trends and compare results.
You can either use our eval server or run your own in a separate docker container.
Eval server configuration
First, you need to configure serenity side.
In your main docker-compose.yml file you need to setup the following lines:
services:
backend:
...
environment:
...
EVAL_STORE: true # whether to save evals to the server
EVAL_STORE_URL: http://<yourhost>:8765/evals/ # the url of the eval server
EVAL_STORE_KEY: <your-eval-key> # the key to access the eval server
Self-hosted eval server
If you want to use self-hosted eval server, you need to do the following:
-
Login in docker registry (should be already done when you setup serenity)
-
Configure docker container. Here is docker-compose.yml example:
services:
eval-store:
image: infra.serenitygpt.com/serenity_eval:latest
container_name: eval-store
restart: unless-stopped
volumes:
- /srv/data/eval-store:/app/files
environment:
EVAL_SECRET_KEY: $EVAL_SECRET_KEY
ports:
- 8765:8000 # change the output port if needed
-
Put
.envfile withEVAL_SECRET_KEY=<your-eval-key>in the same directory asdocker-compose.yml -
Run
docker compose up -dto start the eval server.
Statistical tests
On of the main issues of our evals is their randomness. We may run the same eval multiple times and get different results. The question is the given eval run significantly better/worse than it was before?
To answer this question, we use statistical tests.
Statistical tests are not a silver bullet to answer this question. However, they can help us to answer this question with a certain level of confidence.
To run the statistical tests, you need to annotate the evals first.
Eval annotation
On this stage the goal is to calculate the statistics (the estimated probabilities of questions being correct) for given evals.
To do this one need to run the command: ./manage.py annotate_eval --file <eval-file-name> --times <number-of-times-to-run-the-eval>
In future we will compare all the new eval runs with this baseline. Eval files in evals folder will be modified.
If the eval file is already annotated, you can either aggregate the statistics (default) or use --force to re-annotate the eval.
Run stat tests
After the evals are annotated, you can run the statistical tests.
To do this one need to run the usual eval command: ./manage.py eval --file <eval-file-name> --times <number-of-times-to-run-the-eval>.
The questions with critical statistics (estimated probability is 0.0 or 1.0) will not be used in statistical tests.
So, if you want to run only non-critical questions, you need add the argument --tests-only.
After that, the stat-tests results will be displayed: p-value and verdict (significant or not) for each test.
If the p-value is LOW enough (less than 0.05), the results are significantly BETTER.
If the p-value is HIGH enough (greater than 0.95), the results are not significantly WORSE.
In other cases, the results are not significantly different from the baseline.
Cautions
- Statistical tests work better with a large number of questions in evals. There is no sense to run stat tests with less than 20 DIFFERENT questions (with not critical statistics).
- The rule of thumb we got from our monte-carlo simulations is that there should be at least 200 questions in total to detect significant differences with a high confidence. So, if you have 50 non-critical questions in your evals, you should run eval with option
--times 4. - Tests may be wrong. They may not detect significant differences or they may detect significant differences when there are none. If you are unsure try to increase the number of times to run the eval. Moreover, look manually on p-values. If they are too close to boards (0 and 1), there is an evidence that the changes are significant.
The formal math setup is described here: Setup