Data Sources Configuration
This page documents all supported data sources and their configuration parameters. Each data source can be configured in the config.yaml file under the crawlers section of a tenant.
1. Web Crawler
Crawls web pages starting from specified URLs and follows links recursively.
Example:
crawlers:
docs_site:
module: crawler.web.main.run
parameters:
start_urls:
- https://docs.example.com
match_urls:
- https://docs.example.com/.*
Parameters:
| Parameter | Default | Description |
|---|---|---|
start_urls |
[] |
List of URLs to start crawling from |
match_urls |
[] |
Regular expressions to match URLs that should be crawled |
sitemap_urls |
[] |
List of sitemap URLs to parse for additional pages |
match_content_types |
['text/html', 'application/pdf'] |
Content types to process |
text_selectors |
[] |
CSS selectors to extract text (first match used) |
breadcrumb_selector |
'' |
CSS selector for breadcrumb navigation |
title_selector |
'title::text' |
CSS selector for document title |
metadata |
{} |
Additional metadata for all documents |
2. Local Files
Indexes files from local filesystem directories.
Example:
crawlers:
local_docs:
module: crawler.files.main.run
parameters:
source: docs/
public_url: https://mycompany.com/docs/
Parameters:
| Parameter | Default | Description |
|---|---|---|
source |
Required | Path to source directory (relative to files volume) |
public_url |
Required | Base URL for constructing document links |
sanitize_url |
true |
Whether to sanitize URLs |
extensions |
['.html', '.htm', '.pdf'] |
File extensions to process |
ignore_regex |
'' |
Regex pattern to exclude files/directories |
text_selectors |
['body'] |
CSS selectors for HTML text extraction |
title_selector |
'title::text' |
CSS selector for document title |
breadcrumb_selector |
'' |
CSS selector for breadcrumbs |
encoding |
'utf-8' |
Character encoding for HTML files |
metadata |
{} |
Additional metadata for all documents |
3. Azure Files
Syncs and indexes files from Azure Blob Storage.
Example:
crawlers:
azure_docs:
module: crawler.azurefiles.main.run
parameters:
source: https://myaccount.blob.core.windows.net/container
public_url: https://docs.mycompany.com/
Parameters:
| Parameter | Default | Description |
|---|---|---|
source |
Required | Azure Blob Storage URL |
public_url |
Required | Base URL for public document links |
ignore_regex |
[] |
List of regex patterns to ignore files |
sync_ignore_regex |
[] |
Regex patterns for sync exclusion (defaults to ignore_regex) |
extensions |
[] |
File extensions to process (empty = all) |
target |
ENV.FILES_PATH / 'azuresync' |
Local sync target directory |
text_selectors |
[] |
CSS selectors for text extraction |
title_selector |
'title::text' |
CSS selector for document title |
breadcrumb_selector |
'' |
CSS selector for breadcrumbs |
metadata |
{'document_type': 'DOC'} |
Additional metadata |
4. Confluence
Indexes content from Atlassian Confluence spaces.
Example:
crawlers:
confluence:
module: crawler.confluence.main.run
parameters:
space_key: DOCS
credentials:
base_url: https://mycompany.atlassian.net
username: user@example.com
api_token: your-api-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
space_key |
Required | Confluence space key to crawl |
credentials |
{} |
Authentication credentials (base_url, username, api_token) |
metadata |
{} |
Additional metadata for all documents |
5. Document360
Indexes content from Document360 knowledge bases.
Example:
crawlers:
knowledge_base:
module: crawler.document360.main.run
parameters:
credentials:
api_token: your-api-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
base_url |
"https://apihub.document360.io/v2" |
Document360 API endpoint |
project_version_id |
None |
Specific project version (None = all versions) |
credentials |
{} |
API authentication (api_token required) |
metadata |
{} |
Additional metadata for all documents |
6. Excel Files
Extracts content from Excel spreadsheets.
Example:
crawlers:
excel_data:
module: crawler.excel.main.run
parameters:
path: /data/spreadsheets
title_column: A
text_column: B
Parameters:
| Parameter | Default | Description |
|---|---|---|
path |
Required | Path to Excel files directory |
title_column |
Required | Column for document titles (name or letter) |
text_column |
Required | Column for document content (name or letter) |
extensions |
['.xlsx', '.xls'] |
Excel file extensions to process |
skip_empty_text |
true |
Skip rows with empty text column |
metadata_columns |
None |
Additional columns to include as metadata |
7. PDF Directory
Indexes PDF files from a directory.
Example:
Parameters:
| Parameter | Default | Description |
|---|---|---|
path |
Required | Path to directory containing PDF files |
8. Helpdesk
Indexes tickets and articles from helpdesk systems.
Example:
crawlers:
support_tickets:
module: crawler.helpdesk.main.run
parameters:
credentials:
base_url: https://support.example.com
token: your-api-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
{} |
Authentication (base_url and token required) |
max_tickets |
None |
Maximum number of tickets to fetch |
include_articles |
true |
Include ticket articles/comments |
proxies |
None |
Proxy configuration (e.g., {"https": "socks5://localhost:1337"}) |
query_params |
None |
Additional API query parameters |
metadata |
{} |
Additional metadata for all documents |
9. Jira
Indexes issues from Atlassian Jira projects.
Example:
crawlers:
jira_issues:
module: crawler.jira.main.run
parameters:
projects: [PROJ1, PROJ2]
credentials:
server: https://mycompany.atlassian.net
bearer_token: your-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
projects |
[] |
List of project keys (empty = all projects) |
credentials |
Required | Authentication (see below) |
lookback_period_days |
1095 (3 years) |
How many days to look back for issues |
max_issues |
10000 |
Maximum number of issues to fetch |
Credentials Options:
- Bearer token: {server, bearer_token}
- Basic auth: {server, basic_username, basic_password}
- Optional: proxies for proxy configuration
10. Slack
Indexes conversations from Slack workspaces.
Example:
crawlers:
slack_history:
module: crawler.slack.main.run
parameters:
workspace_url: https://mycompany.slack.com
bot_token: xoxb-your-bot-token
channels: [general, support]
Parameters:
| Parameter | Default | Description |
|---|---|---|
workspace_url |
Required | Slack workspace URL |
bot_token |
Required | Slack bot token (xoxb-...) |
channels |
[] |
Specific channels to crawl (empty = all accessible) |
exclude_channels |
[] |
Channels to exclude from crawling |
grouping_mode |
'hybrid' |
Message grouping: 'thread', 'time', or 'hybrid' |
inactivity_threshold_hours |
24 |
Hours before creating new document (time-based) |
max_messages_per_document |
200 |
Maximum messages per document |
lookback_days |
None |
Days to look back (None = all history) |
exclude_archived |
true |
Skip archived channels |
page_size |
200 |
Messages per API call (max 1000) |
metadata |
{} |
Additional metadata for all documents |
11. YouTrack
Indexes issues from JetBrains YouTrack.
Example:
crawlers:
youtrack:
module: crawler.youtrack.main.run
parameters:
credentials:
base_url: https://youtrack.example.com
token: your-api-token
projects: [PROJ1, PROJ2]
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
{} |
Authentication (base_url and token required) |
projects |
None |
Project IDs to crawl (None = all projects) |
max_issues_per_project |
None |
Limit issues per project |
max_comments_per_issues |
None |
Limit comments per issue |
proxies |
None |
Proxy configuration |
metadata |
{} |
Additional metadata for all documents |
12. Salesforce
Indexes objects from Salesforce CRM.
Example:
crawlers:
salesforce:
module: crawler.salesforce.client
parameters:
credentials:
base_url: https://mycompany.my.salesforce.com
client_id: your-client-id
client_secret: your-client-secret
version: v59.0
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | OAuth credentials (see below) |
Credentials:
- base_url: Salesforce instance URL
- client_id: OAuth client ID
- client_secret: OAuth client secret
- version: API version (e.g., v59.0)
Custom Data Sources
To create a custom data source, implement a run function in a Python module and reference it in the configuration:
# my_custom_crawler.py
from crawler.struct import RunCrawlerConfig, ParsedDocument
def run(conf: RunCrawlerConfig):
# Your crawler implementation
for doc in your_data_source:
yield ParsedDocument(
title=doc.title,
text=doc.content,
url=doc.url,
breadcrumbs=doc.path,
metadata=doc.metadata
)
# config.yaml
crawlers:
custom_source:
module: my_custom_crawler.run
parameters:
# Your custom parameters
For more examples, see: SerenityGPT-examples
Do not see your data source?
There are more data sources available beyond those documented here. Please contact us for documentation and enablement of additional integrations tailored to your specific needs.