Skip to content

Evals

Evals are a core mechanism for validating Serenity's output quality. They help track accuracy, identify weak spots, and ensure performance stays consistent across product updates. See Evals for more details.

Serenity automatically evaluates its responses to eval questions, checking for correctness using predefined criteria.

You can use Evals to:

  • Track baseline performance over time
  • Validate new releases as part of CI/CD
  • Identify inconsistent or problematic cases
  • Ensure responses follow specific instructions

Defining an Eval

Evals are defined in simple .yaml files containing questions, expected answers, and metadata.

The example file below demonstrates the structure of an eval including.

  • "instructions" or "target_url" (or both) can be provided as the correct answer (see Evals)

  • "conversation" field can be used to test follow-up questions in the same conversation

  • "filters" field can be used to specify a filter for retrieving documents when answering the question

questions:
  snapweaver:
    - conversation:  # test conversation mode
      - question: What is SnapWeaver?
        target_url: https://snapweaver.staging.serenitygpt.com/introduction.html
      - question: How to install it?
        target_url: https://snapweaver.staging.serenitygpt.com/installation.html
    - question: what is the capital of China?  # test idk for out of topic questions
      instructions: should mention a lack of knowledge
    - question: How to get early beta history? Give me an exact url to it  # test absolute urls
      instructions: should mention this url https://github.com/SerenityGPT/snapweaver/releasesadfs
    - question: How to add sh?  # test synonyms
      instructions: should mention "Ctrl + Shift + S" for Windows and "Cmd + Shift + S" for a Mac. Mentioning "⌘⇧ + S" is acceptable as well.
    - question: was ist Snapweaver?  # test language detection/translation
      instructions: should give an answer in German
      status: failed
    - question: snapweaverとは何ですか  # test if Japanese characters are translated correctly
      instructions: should give an answer in Japanese
    - question: Where can I read about overview of how the engine works? Give me a url.  # test filters
      filters:
        type: eq
        property: document_type
        value: FAQ
      instructions: should provide this url https://snapweaver.staging.serenitygpt.com/introduction.html
    - question: QZ91  # test keyword search
      target_url: https://snapweaver.staging.serenitygpt.com/release-notes.html
      instructions: should mention v1.0.2 release

Status Field

Each question has an optional status field:

Status Meaning
(blank) / baseline Baseline. Serenity is expected to get this right.
maybe Maybe. Sometimes correct, sometimes incorrect. Uncertain.
false False. Known weak spot — Serenity does not get this right.

How Evals Work

For each question, Serenity generates an answer. The system automatically applies evaluation logic:

URL Check If the question includes a relevant URL, we verify whether the URL was used to support the answer.

Instruction Check If the question has instructions, an LLM checks if the answer conforms to them.


Running Evals

Evals can be executed from the command line:

cd backend
./manage.py eval -t <tenant>

Example:

./manage.py eval -t snapweaver

The results will be displayed directly in the console:

Eval Output Example


Interpreting the Results

The output highlights the status of each question:

  • ❌ Wrong answers for baseline (B) questions — these indicate regressions
  • 🤔 Correct answers for F (false) questions — good news, possible area of improvement

The other notation used in the eval output includes:

  • the question ID to be looked up in the admin view
  • B/F/M for the status of the questions: baseline, false, maybe.
  • ✅/⏹️ for whether the answer is correct/incorrect
  • AUI for Answer(legacy)/URL/Instructions color coded with
    • grey: none provided
    • green: correct
    • red: incorrect
  • a number or '-' following AUI: the order in search results of the target URL if provided or '0'
  • the question

Evals in CI/CD

Running Evals is integrated into the CI/CD pipeline. This ensures:

  • Every new release maintains or improves answer quality
  • Regression detection happens automatically
  • Weak spots are tracked over time

Failures in baseline Evals will cause the CI pipeline to fail, preventing faulty builds from shipping.


Reviewing Eval Details

Results of all eval run can be sent to the eval server where they can be reviewed and compared. eval-server

Clicking "View Details" takes you to the particular run. eval-server

You can pick a question within the run to see its details including the LLM-as-a-judge verdict on whether the answer conforms to the instructions. LLM-as-a-judge


Best Practices

✅ Keep Evals updated as product requirements evolve

✅ Add new questions when bugs are fixed or new features ship

✅ Review "maybe" and "false" questions regularly — improvements may shift their status

✅ Use Evals during both local development and automated CI

Eval server

We use an eval server to keep all evaluations in one place, making it easier to track quality trends and compare results.

You can either use our eval server or run your own in a separate docker container.

Eval server configuration

First, you need to configure serenity side.

In your main docker-compose.yml file you need to setup the following lines:

services:
  backend:
    ...
    environment:
      ...
      EVAL_STORE: true  # whether to save evals to the server
      EVAL_STORE_URL: http://<yourhost>:8765/evals/  # the url of the eval server
      EVAL_STORE_KEY: <your-eval-key>  # the key to access the eval server

Self-hosted eval server

If you want to use self-hosted eval server, you need to do the following:

  1. Login in docker registry (should be already done when you setup serenity)

  2. Configure docker container. Here is docker-compose.yml example:

services:
  eval-store:
    image: infra.serenitygpt.com/serenity_eval:latest
    container_name: eval-store
    restart: unless-stopped
    volumes:
      - /srv/data/eval-store:/app/files
    environment:
      EVAL_SECRET_KEY: $EVAL_SECRET_KEY
    ports:
      - 8765:8000  # change the output port if needed
  1. Put .env file with EVAL_SECRET_KEY=<your-eval-key> in the same directory as docker-compose.yml

  2. Run docker compose up -d to start the eval server.

Statistical tests

On of the main issues of our evals is their randomness. We may run the same eval multiple times and get different results. The question is the given eval run significantly better/worse than it was before?

To answer this question, we use statistical tests.

Statistical tests are not a silver bullet to answer this question. However, they can help us to answer this question with a certain level of confidence.

To run the statistical tests, you need to annotate the evals first.

Eval annotation

On this stage the goal is to calculate the statistics (the estimated probabilities of questions being correct) for given evals.

To do this one need to run the command: ./manage.py annotate_eval --file <eval-file-name> --times <number-of-times-to-run-the-eval>

In future we will compare all the new eval runs with this baseline. Eval files in evals folder will be modified.

If the eval file is already annotated, you can either aggregate the statistics (default) or use --force to re-annotate the eval.

Run stat tests

After the evals are annotated, you can run the statistical tests. To do this one need to run the usual eval command: ./manage.py eval --file <eval-file-name> --times <number-of-times-to-run-the-eval>.

The questions with critical statistics (estimated probability is 0.0 or 1.0) will not be used in statistical tests. So, if you want to run only non-critical questions, you need add the argument --tests-only.

After that, the stat-tests results will be displayed: p-value and verdict (significant or not) for each test.

If the p-value is LOW enough (less than 0.05), the results are significantly BETTER.

If the p-value is HIGH enough (greater than 0.95), the results are not significantly WORSE.

In other cases, the results are not significantly different from the baseline.

Cautions

  • Statistical tests work better with a large number of questions in evals. There is no sense to run stat tests with less than 20 DIFFERENT questions (with not critical statistics).
  • The rule of thumb we got from our monte-carlo simulations is that there should be at least 200 questions in total to detect significant differences with a high confidence. So, if you have 50 non-critical questions in your evals, you should run eval with option --times 4.
  • Tests may be wrong. They may not detect significant differences or they may detect significant differences when there are none. If you are unsure try to increase the number of times to run the eval. Moreover, look manually on p-values. If they are too close to boards (0 and 1), there is an evidence that the changes are significant.

The formal math setup is described here: Setup