Evals

Evals are a core mechanism for validating Serenity's output quality. They help track accuracy, identify weak spots, and ensure performance stays consistent across product updates.

What is an Eval?

An Eval is a structured set of questions with expected answers, defined in a .yaml file. Serenity automatically evaluates its responses to these questions, checking for correctness using predefined criteria.

You can use Evals to:

Track baseline performance over time
Validate new releases as part of CI/CD
Identify inconsistent or problematic cases
Ensure responses follow specific instructions

Defining an Eval

Evals are defined in simple .yaml files containing questions, expected answers, and metadata.

The example file below demonstrates the structure of an eval including.

"instructions" or "target_url" (or both) can be provided as the correct answer (see Evals)
"conversation" field can be used to test follow-up questions in the same conversation
"filters" field can be used to specify a filter for retrieving documents when answering the question

questions:
  snapweaver:
    - conversation:  # test conversation mode
      - question: What is SnapWeaver?
        target_url: https://snapweaver.staging.serenitygpt.com/introduction.html
      - question: How to install it?
        target_url: https://snapweaver.staging.serenitygpt.com/installation.html
    - question: what is the capital of China?  # test idk for out of topic questions
      instructions: should mention a lack of knowledge
    - question: How to get early beta history? Give me an exact url to it  # test absolute urls
      instructions: should mention this url https://github.com/SerenityGPT/snapweaver/releasesadfs
    - question: How to add sh?  # test synonyms
      instructions: should mention "Ctrl + Shift + S" for Windows and "Cmd + Shift + S" for a Mac. Mentioning "⌘⇧ + S" is acceptable as well.
    - question: was ist Snapweaver?  # test language detection/translation
      instructions: should give an answer in German
      status: failed
    - question: snapweaverとは何ですか  # test if Japanese characters are translated correctly
      instructions: should give an answer in Japanese
    - question: Where can I read about overview of how the engine works? Give me a url.  # test filters
      filters:
        type: eq
        property: document_type
        value: FAQ
      instructions: should provide this url https://snapweaver.staging.serenitygpt.com/introduction.html
    - question: QZ91  # test keyword search
      target_url: https://snapweaver.staging.serenitygpt.com/release-notes.html
      instructions: should mention v1.0.2 release

Status Field

Each question has an optional status field:

Status	Meaning
(blank) / `baseline`	Baseline. Serenity is expected to get this right.
`maybe`	Maybe. Sometimes correct, sometimes incorrect. Uncertain.
`false`	False. Known weak spot — Serenity does not get this right.

How Evals Work

For each question, Serenity generates an answer. The system automatically applies evaluation logic:

✅ URL Check If the question includes a relevant URL, we verify whether the URL was used to support the answer.

✅ Instruction Check If the question has instructions, an LLM checks if the answer conforms to them.

Running Evals

Evals can be executed from the command line:

cd backend
./manage.py eval -t <tenant>

Example:

./manage.py eval -t snapweaver

The results will be displayed directly in the console:

Eval Output Example

Interpreting the Results

The output highlights the status of each question:

❌ Wrong answers for baseline (B) questions — these indicate regressions
🤔 Correct answers for F (false) questions — good news, possible area of improvement

The other notation used in the eval output includes:

the question ID to be looked up in the admin view
B/F/M for the status of the questions: baseline, false, maybe.
✅/⏹️ for whether the answer is correct/incorrect
AUI for Answer(legacy)/URL/Instructions color coded with
- grey: none provided
- green: correct
- red: incorrect
a number or '-' following AUI: the order in search results of the target URL if provided or '0'
the question

Evals in CI/CD

Running Evals is integrated into the CI/CD pipeline. This ensures:

Every new release maintains or improves answer quality
Regression detection happens automatically
Weak spots are tracked over time

Failures in baseline Evals will cause the CI pipeline to fail, preventing faulty builds from shipping.

Reviewing Eval Details

Results of all eval run can be sent to the eval server where they can be reviewed and compared. eval-server

Clicking "View Details" takes you to the particular run. eval-server

You can pick a question within the run to see its details including the LLM-as-a-judge verdict on whether the answer conforms to the instructions.

Best Practices

✅ Keep Evals updated as product requirements evolve

✅ Add new questions when bugs are fixed or new features ship

✅ Review "maybe" and "false" questions regularly — improvements may shift their status

✅ Use Evals during both local development and automated CI