Evals
Evals are a core mechanism for validating Serenity's output quality. They help track accuracy, identify weak spots, and ensure performance stays consistent across product updates.
What is an Eval?
An Eval is a structured set of questions with expected answers, defined in a .yaml
file. Serenity automatically evaluates its responses to these questions, checking for correctness using predefined criteria.
You can use Evals to:
- Track baseline performance over time
- Validate new releases as part of CI/CD
- Identify inconsistent or problematic cases
- Ensure responses follow specific instructions
Defining an Eval
Evals are defined in simple .yaml
files containing questions, expected answers, and metadata.
The example file below demonstrates the structure of an eval including.
-
"instructions" or "target_url" (or both) can be provided as the correct answer (see Evals)
-
"conversation" field can be used to test follow-up questions in the same conversation
-
"filters" field can be used to specify a filter for retrieving documents when answering the question
questions:
snapweaver:
- conversation: # test conversation mode
- question: What is SnapWeaver?
target_url: https://snapweaver.staging.serenitygpt.com/introduction.html
- question: How to install it?
target_url: https://snapweaver.staging.serenitygpt.com/installation.html
- question: what is the capital of China? # test idk for out of topic questions
instructions: should mention a lack of knowledge
- question: How to get early beta history? Give me an exact url to it # test absolute urls
instructions: should mention this url https://github.com/SerenityGPT/snapweaver/releasesadfs
- question: How to add sh? # test synonyms
instructions: should mention "Ctrl + Shift + S" for Windows and "Cmd + Shift + S" for a Mac. Mentioning "⌘⇧ + S" is acceptable as well.
- question: was ist Snapweaver? # test language detection/translation
instructions: should give an answer in German
status: failed
- question: snapweaverとは何ですか # test if Japanese characters are translated correctly
instructions: should give an answer in Japanese
- question: Where can I read about overview of how the engine works? Give me a url. # test filters
filters:
type: eq
property: document_type
value: FAQ
instructions: should provide this url https://snapweaver.staging.serenitygpt.com/introduction.html
- question: QZ91 # test keyword search
target_url: https://snapweaver.staging.serenitygpt.com/release-notes.html
instructions: should mention v1.0.2 release
Status Field
Each question has an optional status
field:
Status | Meaning |
---|---|
(blank) / baseline |
Baseline. Serenity is expected to get this right. |
maybe |
Maybe. Sometimes correct, sometimes incorrect. Uncertain. |
false |
False. Known weak spot — Serenity does not get this right. |
How Evals Work
For each question, Serenity generates an answer. The system automatically applies evaluation logic:
✅ URL Check If the question includes a relevant URL, we verify whether the URL was used to support the answer.
✅ Instruction Check If the question has instructions, an LLM checks if the answer conforms to them.
Running Evals
Evals can be executed from the command line:
Example:
The results will be displayed directly in the console:
Interpreting the Results
The output highlights the status of each question:
- ❌ Wrong answers for baseline (
B
) questions — these indicate regressions - 🤔 Correct answers for
F
(false) questions — good news, possible area of improvement
The other notation used in the eval output includes:
- the question ID to be looked up in the admin view
- B/F/M for the status of the questions: baseline, false, maybe.
- ✅/⏹️ for whether the answer is correct/incorrect
- AUI for Answer(legacy)/URL/Instructions color coded with
- grey: none provided
- green: correct
- red: incorrect
- a number or '-' following AUI: the order in search results of the target URL if provided or '0'
- the question
Evals in CI/CD
Running Evals is integrated into the CI/CD pipeline. This ensures:
- Every new release maintains or improves answer quality
- Regression detection happens automatically
- Weak spots are tracked over time
Failures in baseline Evals will cause the CI pipeline to fail, preventing faulty builds from shipping.
Reviewing Eval Details
Results of all eval run can be sent to the eval server where they can be reviewed and compared.
Clicking "View Details" takes you to the particular run.
You can pick a question within the run to see its details including the LLM-as-a-judge verdict on whether the answer conforms to the instructions.
Best Practices
✅ Keep Evals updated as product requirements evolve
✅ Add new questions when bugs are fixed or new features ship
✅ Review "maybe" and "false" questions regularly — improvements may shift their status
✅ Use Evals during both local development and automated CI