Skip to content

Evals

What is an Eval?

An eval (short for evaluation set) is a collection of questions, each with specific criteria for what constitutes a correct or high-quality answer. We use evals to automatically measure the quality of our system’s responses. By running our system on an evalset, we can assess how well it meets the defined criteria for each question.

All decisions about rolling out new features or models are based on the results of evals.

Supported Criteria

Currently, we support the following criteria for evaluating answers:

  • Target URL: The answer should rely on information from a specific reference document, provided as a URL.
  • Instructions: The answer should follow specific human-language instructions, such as including or excluding certain information, or following a particular style.

Below is a conceptual example of an evalset, showing how questions can be paired with criteria:

  • Question: How do I install SerenityGPT?
    Target URL: https://docs.serenitygpt.com/deployment/overview/

  • Question: What main features does SerenityGPT have?
    Instructions: The answer should mention custom integration and security.


More Details

Target URL

How the criteria is checked:

SerenityGPT strictly relies on documentation to answer questions. Every answer contains references to the documentation. If at least one of these references matches the target URL criteria, the target URL criterion is met.

Supported formats:

  • An exact URL to be matched
  • A regular expression to be matched
  • A list of exact URLs or regular expressions to be matched ("or" logic is used, so if any of the URLs or regexes match, the criterion is met)

Instructions

How the criteria is checked:

We use an LLM-as-judge to check if the answer follows the instructions. This means we provide the original question, Serenity's answer, and the instructions to a judge LLM. The judge LLM then determines if the answer follows the instructions.

Some requirements for writing effective instructions:

  • Instructions should not be tied to one specific answer or phrased from a personal perspective. For example:

    • ✗ “I don’t like that this answer mentions ABC.”
    • ✓ “The answer should not mention ABC.”
  • Instructions should be as human-readable and concrete as possible. For example: If you see emojis in answers and don’t like it, simply write: “The answer should not contain any emojis.” Don’t try to explain why it happens—just clearly state what you expect from the answer as a user.


Best Practices and Concerns

  • These evals are automated, so try to be as specific as possible — this will lead to more accurate results.
  • Serenity's answers are not deterministic (nor are LLM-as-judge's answers), so even the same version may give slightly different scores. This is fine. The more specific the instructions, the less volatility in scores.
  • Always try to use the "target URL" criterion. Usually, it is much easier to copy-paste the URL from the documentation than to write instructions. On our side, "target URL" is also preferred because:
    • it is a deterministic criterion
    • usually, if the answer relies on the correct document, it is conceptually correct
  • Use the "instructions" criterion only when it’s clear how to determine whether the answer follows it. For example:
    • “The answer should start with 'Here is the way to install...'” - is a good instruction
    • “The answer should not be too vague” - is not

Eval formats

We support all the usual formats for evalsets:

  • CSV
  • YAML
  • JSON

The only requirement is that the file should contain the following fields (columns):

  • question (values required)
  • target_url (values optional)
  • instructions (values optional)

While we support all of the listed formats, .yaml is the system’s native format. Any other format will be automatically converted to .yaml.

Here is an example of a YAML file we use for evals.

questions:
  tenant-name:
    - question: How do I install SerenityGPT?
      target_url: https://docs.serenitygpt.com/deployment/overview/
    - question: What is SerenityGPT?
      instructions: The answer should mention that SerenityGPT is a powerful enterprise search solution designed to revolutionize how organizations access and utilize their internal knowledge
      target_url:
        - https://docs.serenitygpt.com/
        - https://docs.serenitygpt.com/product/overview/
    - question: What main features does SerenityGPT have?
      instructions: The answer should mention custom integration and security.
      target_url: ^docs.serenitygpt.com/.*

Here is an example of the same configuration table format:

Tenant Name Question Target URL(s) Instructions
tenant-name How do I install SerenityGPT? https://docs.serenitygpt.com/deployment/overview/ -
tenant-name What is SerenityGPT? https://docs.serenitygpt.com/
https://docs.serenitygpt.com/product/overview/
The answer should mention that SerenityGPT is a powerful enterprise search solution designed to revolutionize how organizations access and utilize their internal knowledge
tenant-name What main features does SerenityGPT have? ^docs.serenitygpt.com/.* The answer should mention custom integration and security.

Advanced usage

Filters usage

You can configure filters for the documentation that will be applied when answering. For more information about filters see https://docs.serenitygpt.com/how-to/search/custom-embedding/?h=filters#with-filters

To use filters, you need to add a filter field to the evalset. YAML example:

questions:
  tenant-name:
    - question: How do I install SerenityGPT?
      filter:
        - type: eq
          property: document_type
          value: DOC

Conversation

A conversation is a list of user's questions in one chat. We don't support eval for all the conversation, but we support eval for every question in the conversation.

To use conversation, you need to add a conversation field to the evalset. YAML example:

questions:
  tenant-name:
    conversation:
      - question: What is SerenityGPT?
        instructions: The answer should mention that SerenityGPT is a powerful enterprise search solution designed to revolutionize how organizations access and utilize their internal knowledge
      - question: How do I install it?
        target_url: https://docs.serenitygpt.com/deployment/overview/
    - question: What is eval?  # different chat

Here is a human-readable table for the above conversation example:

Tenant Name Chat (conversation) number Question Target URL(s) Instructions
tenant-name 1 What is SerenityGPT? - The answer should mention that SerenityGPT is a powerful enterprise search solution designed to revolutionize how organizations access and utilize their internal knowledge
tenant-name 1 How do I install it? https://docs.serenitygpt.com/deployment/overview/ -
tenant-name 2 What is eval? - -

Eval server

We use an eval server to keep all evaluations in one place, making it easier to track quality trends and compare results.

You can either use our eval server or run your own in a separate docker container.

Eval server configuration

First, you need to configure serenity side.

In your main docker-compose.yml file you need to setup the following lines:

services:
  backend:
    ...
    environment:
      ...
      EVAL_STORE: true  # whether to save evals to the server
      EVAL_STORE_URL: http://<yourhost>:8765/evals/  # the url of the eval server
      EVAL_STORE_KEY: <your-eval-key>  # the key to access the eval server

Self-hosted eval server

If you want to use self-hosted eval server, you need to do the following:

  1. Login in docker registry (should be already done when you setup serenity)

  2. Configure docker container. Here is docker-compose.yml example:

services:
  eval-store:
    image: infra.serenitygpt.com/serenity_eval:latest
    container_name: eval-store
    restart: unless-stopped
    volumes:
      - /srv/data/eval-store:/app/files
    environment:
      EVAL_SECRET_KEY: $EVAL_SECRET_KEY
    ports:
      - 8765:8000  # change the output port if needed
  1. Put .env file with EVAL_SECRET_KEY=<your-eval-key> in the same directory as docker-compose.yml

  2. Run docker compose up -d to start the eval server.