Data source and prompt configuration format

The system is configured using the config.yaml file, which is located in the installation folder. This file serves as company configuration settings, including document extraction parameters, indexing options, tenant-specific setups, and data source integrations. By modifying the config.yaml, administrators can customize how documents are sourced, processed, and indexed across different tenants and integration types. Once saved and validated the system will start indexing data automatically.

Example of the config.yaml:

name: Company Name

prompt:
  product_name: My Product


vector_store: pgvector


manage_documents_api:
  enabled: true


tenants:
  mytenant1:
    name: Product 1
      crawlers:
        online_docs:
          module: crawler.web.main.run
          parameters:
            start_urls:
              - https://product.mycompany.com/
            text_selectors:
              - article
            breadcrumb_selector: .md-nav__item--active>label
        files_on_server:
          module: crawler.files.main.run
          parameters:
            source: path/to/files/on/disk
            public_url: https://mycompany.com/product1/
            sanitize_url: false
            text_selectors:
              - main article
            breadcrumb_selector: nav.locatordiv li > a::text
            metadata:
              product: Product 1

This YAML file configures the document extraction, indexing, and prompting system for a company. It defines global settings, tenant-specific configurations, and crawler integrations to process documents from various sources.

1. Global Configuration

name:

The human-readable name of the company. Example: "Company name"

prompt:

Contains settings used for prompt customization in the application. See Prompt Configuration for more details.

vector_store: Default: pgvector

Specifies the backend to be used for vector-based document indexing and search. Possible values include: iris, pgvector.

tenants:

List of tenants (at least one required) configuration. See Tenant Configuration for more details.

manage_documents_api: Default: {enabled: false}

Enables or disables the document ingestion API.

2. Tenant Configuration

Tenants represent isolated groups of documents for different user groups or regions. Each tenant is defined under the tenants section with its own unique configuration.

name:

The display name for the tenant.

crawlers:

A collection of crawler configurations for data extraction within the tenant. See Crawler Configuration for more details.

search:

Configuration for API search. See API Search Configuration for more details.

prompt:

Configuration for prompt customization. See Prompt Configuration for more details.

3. Crawler Configuration

Overview

Each crawler in a tenant is responsible for integrating and extracting documents from a specific data source. The document should contain the following attribues:

title: (str)

The title of the document. It is used to display the document in the UI. Moreover, it is encoded with another document content to create a vector, so title affects the search results.

text: (str)

The main content of the document. It is used to search the document, so try to make it as relevant and clear as possible.

url: (str)

The url of the document. It is used to reference the document in the UI. So, it should be a valid and clickable url.

breadcrumbs: (list[str])

The breadcrumbs of the document or the path to the document in the source. It is used in enocding along with title, so it affects the search results. Moreover, it is used by the LLM to understand the document structure.

metadata: (dict[str, Any])

The metadata of the document. It is used to attach additional information to the document. We can use metadata keys as filters in the search.

The crawler configuration includes:

module:

A Python path to the function that performs document extraction.

parameters:

Detailed settings specific to the crawler type:

3.1 Web crawler

This section provides technical guidance for configuring data ingestion from web pages. Web crawler starts from start_urls and crawls all match_urls recursively.

Basic Configuration Example

name_of_source:
  module: crawler.web.main.run
  parameters:
    start_urls:
      - https://example.com/index.html
    match_urls:
      - https://example.com/.*
    metadata:
      document_type: 'DOC'

Parameters

start_urls (list of strings) Required The URLs to start crawling from.

match_urls (list of strings) Default: [] A list of regular expressions to match URLs that should be crawled. If not provided, all URLs will be crawled with start_urls prefix.

text_selectors (list of strings) Default: ['body'] A list of CSS selectors to extract text content from the crawled pages. The crawler will stop after the first selector matches. Examples: 'div.content', 'div.main'

breadcrumb_selector (string) Default: '' A CSS selector to extract breadcrumb navigation from the crawled pages. Examples: 'nav.breadcrumb', 'ol.breadcrumb'

title_selector (string) Default: 'title::text' A CSS selector to extract the document title from the crawled pages. Examples: 'h1::text', td.header::text

metadata (dictionary) Default: {} Additional metadata to be attached to all documents from this source.

match_content_types (list of strings) Default: ['text/html', 'application/pdf'] A list of content types to match.

3.2 Files crawler

This section provides technical guidance for configuring data ingestion from files stored on disk.

Basic Configuration Example

name_of_source:
  module: crawler.files.main.run
  parameters:
    source: path/to/source
    public_url: file://url/prefix
    title_selector: 'title::text'
    metadata:
      document_type: "FILE"

Parameters

source (string) Required The path to the directory containing the files to be crawled. This path should be relative to the files volume (serenity/data/backend-files by default) Example: source: documents/2025 means that the crawler will traverse through the directory serenity/data/backend-files/documents/2025 and all subdirectories recursively.

public_url (string) Required The base URL that will be used to construct public-facing URLs for the ingested content. For local files, use file:// as the protocol.

sanitize_url (boolean)
Default: true
Determines whether URLs should be sanitized during processing. Set to false if you want to preserve the original URL format.

extensions (list of strings)
Default: ['.html', '.htm', '.pdf']
Specifies which file extensions should be processed by the crawler. Only files matching these extensions will be ingested.

metadata (dictionary)
Default: {}
Additional metadata to be attached to all documents from this source.

ignore_regex (string or list of strings)
Default: ''
Regular expression pattern(s) to exclude files or directories from crawling. Files matching these patterns will be skipped.

text_selectors (list of strings)
Default: ['body']
CSS selectors used to extract text content from HTML files. The crawler will stop after the first selector matches. Examples: ['div.content', 'div.main']

title_selector (string)
Default: 'title::text'
CSS selector used to extract the document title from HTML files. The ::text suffix indicates that only the text content should be extracted. Examples: 'h1::text', td.header::text

breadcrumb_selector (string)
Default: ''
CSS selector for extracting breadcrumb navigation from HTML files. Leave empty if breadcrumbs are not available or needed. Examples: 'nav.breadcrumb', 'ol.breadcrumb'

encoding (string)
Default: 'utf-8'
Character encoding for reading HTML files. Adjust if your files use a different encoding.

3.3 Custom crawler

The default file crawler supports only HTML, HTM, and PDF files. If you want to crawl other file types, you can create a custom crawler.

To do this, you need to do 2 things:

implement custom run function in a separate file.
configure crawler file path in config.yaml.

See the examples here: SerenityGPT-examples

4. API Search Configuration

By default, SerenityGPT retrieves documents for a given tenant from a local database without any API. However, there are scenarios when you may want to use distributed RAG. The API search feature allows you to achieve this.

Why use API search?

Your documentation database is hosted on a different machine or service.
You want to share the same documentation across multiple SerenityGPT instances (e.g., for different projects or teams).
You need to perform cross-tenant search (e.g., when a question about tenant A should also search documents from tenant B).

How it works

Each tenant can be configured to use local search, remote API search, or both. When remote sources are configured, SerenityGPT will query the specified API endpoints for relevant documents in addition to (or instead of) the local database.

Configuration

API search is configured under the search section of each tenant in your config.yaml. The relevant model is SearchConfig in ai/conf.py:

tenants:
  mytenant1:
    name: Product 1
    crawlers:
      # ... crawler configs ...
    search:
      use_local_search: true  # Whether to use the local database (default: true)
      remote_sources:
        - url: "https://other-serenity-instance.com/api/v2/"
          token: <tenant_token>
        # You can add multiple remote sources if needed

use_local_search: (bool, default true) If set to false, only remote sources will be used for document retrieval.
remote_sources: A list of remote API endpoints to query for documents. Each entry requires:
url: The full URL to the remote search API (should point to /api/v2/ on the remote instance).
token: The token of tenant on the remote instance.

Example use cases: - To use only local documentation, jsut skip the remote_sources section. - To use only remote documentation, set use_local_search: false and provide one or more remote_sources. - To combine local and remote search, set use_local_search: true and provide remote_sources.

Note: The remote API must be compatible with the /api/v2/rag/search/ endpoint, see Search API for Distributed RAG for more details.

5. Prompt Configuration

Overview

The prompt section in your config.yaml controls how the language model (LLM) interacts with users, customizes responses, and integrates with external tools. This section allows you to fine-tune the assistant’s behavior, language handling, and prompt context.

Example:

prompt:
  product_name: My Product
  llm: azure:gpt-4o-20241120:2024-10-21
  idk_phrases:
    en: "Sorry, I don't know that."
    es: "Lo siento, no lo sé."
  pii_redaction: true
  source_validation: true
  documents_language: en
  reply_language: null
  synonyms:
    FAQ: Frequently Asked Questions
  tools:
    documents_search: agent.tools
    web_search: agent.web
  system_prompt_template: "{% extends 'system_prompt.txt' %}"
  examples:
    - part_kind: user-prompt
      content: how do i install sample_software_name?
    - part_kind: text
      content: to install sample_software_name, follow these steps...

Fields

product_name: Sets the product name for branding and context in prompts if you want to mention it in the prompt.
llm: Specifies which LLM and which API to use. See LLM configuration for more details.
idk_phrases: Allows customization of fallback responses (no information found) in multiple languages.
pii_redaction: When enabled, ensures sensitive information is not included in responses (beta feature).
source_validation: Enforces that only validated sources are used for answers (beta feature).
documents_language: (default null) The language of the indexed content (aka knowledge). Should be set if you expect queries in other languages: the queries will be translated to the documents_language for retrieval.
reply_language: (default en) Sets the language of the response. Defaults to English. If set to null, the response language is determined by detecting the language of the query.
- Common configurations:
  - Do not detect the language of the query: documents_language: null and reply_language: <language_code>
  - Respond in the language of the query: documents_language: <language_code> and reply_language: null
synonyms: Improves search by including specified synonyms in the prompt.
tools: Integrates external tools for enhanced capabilities in format of tool_name->module_path. See Tools configuration for more details.
system_prompt_template: Customizes the system prompt using Django templates. See System prompt configuration for more details.
examples: Provides few-shot learning examples to improve LLM response quality. Examples are inserted directly into the system prompt.

LLM configuration

You can use openai, azure, groq providers or your own custom LLM. The format is <provider>:<model_name>(:<api_vection>)

Tools configuration

The custom tools may be defined in this section. The tool is the function that meets the following requirements: - it should have a comprehensive description of what it does, when it should be used and what arguments are supported. - arguments and return value should be typed. - the first argument should be the pydantic_ai RunContext object. See pydantic_ai tools for more details.

Example:

def tool(context: RunContext, query: str) -> str:
    """
    Do something.
    Used for ...
    Args:
        query: The query of user.
    """
    return "Hello, world!"

System prompt configuration

One may configure the system prompt for the LLM. It consists of 3 configurable sections: main, tools, rules.

main:

Here you can define the main role of the LLM.

tools:

Here you can define instructions for the LLM to use the tools.

rules:

Here you can define the rules for the LLM to follow.

Example of system_prompt_template:

{% extends "system_prompt.txt" %}
{% block main %}
You are a helpful assistant.
{% endblock %}

{% block tools %}
Use `tool` tool to do something whenever you need to.
{% endblock %}

{% block rules %}
Rules:
• think carefully before answering.
{% endblock %}