Skip to content

Eval Sets

Evaluating the quality of Serenity's configuration is what lets us achieve exceptional accuracy. At the heart of this evaluation process are eval sets — structured collections of questions and correct answers to measure accuracy on.

What is an Eval?

An eval is a curated dataset consisting of:

  • A list of questions that are likely to be asked or have been asked before.
  • Corresponding correct answers.

During evaluation, SerenityGPT is prompted with each question. Its response is then compared to the correct answer using LLM-as-a-judge. This judge assesses how well the generated answer aligns with the expected response.

The performance on these eval sets drives all configuration decisions.

Formats for Correct Answers

We support two formats for providing correct answers in an eval set:

  1. Reference Link (preferred and easiest)
    Link to the source document or page where the correct answer can be found.

  2. Answer Highlights
    A high-level summary of the answer that can also mention what the answer should not contain.

Feel free to mix and match the formats or specify both for the same question. You can provide the eval as a CSV file or as YAML file.

CSV Example

Question Link Answer
Can I travel with a dog https://help.lyft.com/hc/en-us/all/articles/8559088908-pet-rides-for-riders
How do I install Puppet Enterprise mention 2 installation modes: tarball and installation manager

YAML Example

questions:
  <tenant_name>:
    - question: Can I travel with a dog
      target_url: https://help.lyft.com/hc/en-us/all/articles/8559088908-pet-rides-for-riders
    - question: How do I install Puppet Enterprise
      target_answer: mention 2 installation modes: tarball and installation manager