Data source and prompt configuration format
The system is configured using the config.yaml
file, which is located in the installation folder.
This file serves as company configuration settings, including document extraction parameters, indexing options, tenant-specific setups, and data source integrations.
By modifying the config.yaml
, administrators can customize how documents are sourced, processed, and indexed across different tenants and integration types.
Once saved and validated the system will start indexing data automatically.
Example of the config.yaml
:
name: Company Name
prompt:
product_name: My Product
vector_store: pgvector
manage_documents_api:
enabled: true
tenants:
mytenant1:
name: Product 1
crawlers:
online_docs:
module: crawler.web.main.run
parameters:
start_urls:
- https://product.mycompany.com/
text_selectors:
- article
breadcrumb_selector: .md-nav__item--active>label
files_on_server:
module: crawler.files.main.run
parameters:
source: path/to/files/on/disk
public_url: https://mycompany.com/product1/
sanitize_url: false
text_selectors:
- main article
breadcrumb_selector: nav.locatordiv li > a::text
metadata:
product: Product 1
This YAML file configures the document extraction, indexing, and prompting system for a company. It defines global settings, tenant-specific configurations, and crawler integrations to process documents from various sources.
1. Global Configuration
name:
The human-readable name of the company. Example: "Company name"
prompt:
Contains settings used for prompt customization in the application. See Prompt Configuration for more details.
vector_store:
Default: pgvector
Specifies the backend to be used for vector-based document indexing and search. Possible values include:
iris
,pgvector
.
tenants:
List of tenants (at least one required) configuration. See Tenant Configuration for more details.
manage_documents_api:
Default: {enabled: false}
Enables or disables the document ingestion API.
2. Tenant Configuration
Tenants represent isolated groups of documents for different user groups or regions. Each tenant is defined under the tenants section with its own unique configuration.
name:
The display name for the tenant.
crawlers:
A collection of crawler configurations for data extraction within the tenant. See Crawler Configuration for more details.
search:
Configuration for API search. See API Search Configuration for more details.
prompt:
Configuration for prompt customization. See Prompt Configuration for more details.
3. Crawler Configuration
Each crawler in a tenant is responsible for integrating and extracting documents from a specific data source. The configuration includes:
module:
A Python path to the function that performs document extraction.
parameters:
Detailed settings specific to the crawler type:
3.1 Web crawler
This section provides technical guidance for configuring data ingestion from web pages.
Web crawler starts from start_urls
and crawls all match_urls
recursively.
Basic Configuration Example
name_of_source:
module: crawler.web.main.run
parameters:
start_urls:
- https://example.com/index.html
match_urls:
- https://example.com/.*
metadata:
document_type: 'DOC'
Parameters
start_urls
(list of strings)
Required
The URLs to start crawling from.
match_urls
(list of strings)
Default: []
A list of regular expressions to match URLs that should be crawled. If not provided, all URLs will be crawled with start_urls
prefix.
text_selectors
(list of strings)
Default: ['body']
A list of CSS selectors to extract text content from the crawled pages. The crawler will stop after the first selector matches.
Examples: 'div.content'
, 'div.main'
breadcrumb_selector
(string)
Default: ''
A CSS selector to extract breadcrumb navigation from the crawled pages.
Examples: 'nav.breadcrumb'
, 'ol.breadcrumb'
title_selector
(string)
Default: 'title::text'
A CSS selector to extract the document title from the crawled pages.
Examples: 'h1::text'
, td.header::text
metadata
(dictionary)
Default: {}
Additional metadata to be attached to all documents from this source.
match_content_types
(list of strings)
Default: ['text/html', 'application/pdf']
A list of content types to match.
3.2 Files crawler
This section provides technical guidance for configuring data ingestion from files stored on disk.
Basic Configuration Example
name_of_source:
module: crawler.files.main.run
parameters:
source: path/to/source
public_url: file://url/prefix
title_selector: 'title::text'
metadata:
document_type: "FILE"
Parameters
source
(string)
Required
The path to the directory containing the files to be crawled. This path should be relative to the files volume (serenity/data/backend-files
by default)
Example: source: documents/2025
means that the crawler will traverse through the directory serenity/data/backend-files/documents/2025
and all subdirectories recursively.
public_url
(string)
Required
The base URL that will be used to construct public-facing URLs for the ingested content. For local files, use file://
as the protocol.
sanitize_url
(boolean)
Default: true
Determines whether URLs should be sanitized during processing. Set to false
if you want to preserve the original URL format.
extensions
(list of strings)
Default: ['.html', '.htm', '.pdf']
Specifies which file extensions should be processed by the crawler. Only files matching these extensions will be ingested.
metadata
(dictionary)
Default: {}
Additional metadata to be attached to all documents from this source.
ignore_regex
(string or list of strings)
Default: ''
Regular expression pattern(s) to exclude files or directories from crawling. Files matching these patterns will be skipped.
text_selectors
(list of strings)
Default: ['body']
CSS selectors used to extract text content from HTML files. The crawler will stop after the first selector matches.
Examples: ['div.content', 'div.main']
title_selector
(string)
Default: 'title::text'
CSS selector used to extract the document title from HTML files. The ::text
suffix indicates that only the text content should be extracted.
Examples: 'h1::text'
, td.header::text
breadcrumb_selector
(string)
Default: ''
CSS selector for extracting breadcrumb navigation from HTML files. Leave empty if breadcrumbs are not available or needed.
Examples: 'nav.breadcrumb'
, 'ol.breadcrumb'
encoding
(string)
Default: 'utf-8'
Character encoding for reading HTML files. Adjust if your files use a different encoding.
4. API Search Configuration
By default, SerenityGPT retrieves documents for a given tenant from a local database without any API. However, there are scenarios when you may want to use distributed RAG. The API search feature allows you to achieve this.
Why use API search?
- Your documentation database is hosted on a different machine or service.
- You want to share the same documentation across multiple SerenityGPT instances (e.g., for different projects or teams).
- You need to perform cross-tenant search (e.g., when a question about tenant A should also search documents from tenant B).
How it works
Each tenant can be configured to use local search, remote API search, or both. When remote sources are configured, SerenityGPT will query the specified API endpoints for relevant documents in addition to (or instead of) the local database.
Configuration
API search is configured under the search
section of each tenant in your config.yaml
. The relevant model is SearchConfig
in ai/conf.py
:
tenants:
mytenant1:
name: Product 1
crawlers:
# ... crawler configs ...
search:
use_local_search: true # Whether to use the local database (default: true)
remote_sources:
- url: "https://other-serenity-instance.com/api/v2/"
token: <tenant_token>
# You can add multiple remote sources if needed
use_local_search
: (bool, defaulttrue
) If set tofalse
, only remote sources will be used for document retrieval.remote_sources
: A list of remote API endpoints to query for documents. Each entry requires:url
: The full URL to the remote search API (should point to/api/v2/
on the remote instance).token
: The token of tenant on the remote instance.
Example use cases:
- To use only local documentation, jsut skip the remote_sources
section.
- To use only remote documentation, set use_local_search: false
and provide one or more remote_sources
.
- To combine local and remote search, set use_local_search: true
and provide remote_sources
.
Note: The remote API must be compatible with the /api/v2/rag/search/
endpoint, see Search API for Distributed RAG for more details.
5. Prompt Configuration
Overview
The prompt
section in your config.yaml
controls how the language model (LLM) interacts with users, customizes responses, and integrates with external tools. This section allows you to fine-tune the assistant’s behavior, language handling, and prompt context.
Example:
prompt:
product_name: My Product
llm: azure:gpt-4o-20241120:2024-10-21
idk_phrases:
en: "Sorry, I don't know that."
es: "Lo siento, no lo sé."
pii_redaction: true
source_validation: true
use_translator: true
target_language: en
synonyms:
FAQ: Frequently Asked Questions
tools:
documents_search: agent.tools
web_search: agent.web
system_prompt_template: "{% extends 'system_prompt.txt' %}"
examples:
- part_kind: user-prompt
content: how do i install sample_software_name?
- part_kind: text
content: to install sample_software_name, follow these steps...
Fields
- product_name: Sets the product name for branding and context in prompts if you want to mention it in the prompt.
- llm: Specifies which LLM and which API to use. See LLM configuration for more details.
- idk_phrases: Allows customization of fallback responses (no information found) in multiple languages.
- pii_redaction: When enabled, ensures sensitive information is not included in responses (beta feature).
- source_validation: Enforces that only validated sources are used for answers (beta feature).
- use_translator: Enables or disables query translation to the target language.
- target_language: The documentation language. It is used to translate the query to the documentation language if
use_translator
is enabled. - synonyms: Improves search by including specified synonyms in the prompt.
- tools: Integrates external tools for enhanced capabilities in format of tool_name->module_path. See Tools configuration for more details.
- system_prompt_template: Customizes the system prompt using Django templates. See System prompt configuration for more details.
- examples: Provides few-shot learning examples to improve LLM response quality. Examples are inserted directly into the system prompt.
LLM configuration
You can use openai, azure, groq providers or your own custom LLM.
The format is <provider>:<model_name>(:<api_vection>)
Tools configuration
The custom tools may be defined in this section.
The tool is the function that meets the following requirements:
- it should have a comprehensive description of what it does, when it should be used and what arguments are supported.
- arguments and return value should be typed.
- the first argument should be the pydantic_ai RunContext
object.
See pydantic_ai tools for more details.
Example:
def tool(context: RunContext, query: str) -> str:
"""
Do something.
Used for ...
Args:
query: The query of user.
"""
return "Hello, world!"
System prompt configuration
One may configure the system prompt for the LLM. It consists of 3 configurable sections: main, tools, rules.
main:
Here you can define the main role of the LLM.
tools:
Here you can define instructions for the LLM to use the tools.
rules:
Here you can define the rules for the LLM to follow.
Example of system_prompt_template: