Skip to content

Data Sources Configuration

This page documents all supported data sources and their configuration parameters. Each data source can be configured in the config.yaml file under the crawlers section of a tenant.

1. Web Crawler

Crawls web pages starting from specified URLs and follows links recursively.

Example:

crawlers:
  docs_site:
    module: crawler.web.main.run
    parameters:
      start_urls:
        - https://docs.example.com
      match_urls:
        - https://docs.example.com/.*

Parameters:

Parameter Default Description
start_urls [] List of URLs to start crawling from
match_urls [] Regular expressions to match URLs that should be crawled
sitemap_urls [] List of sitemap URLs to parse for additional pages
match_content_types ['text/html', 'application/pdf'] Content types to process
text_selectors [] CSS selectors to extract text (first match used)
breadcrumb_selector '' CSS selector for breadcrumb navigation
title_selector 'title::text' CSS selector for document title
metadata {} Additional metadata for all documents

2. Local Files

Indexes files from local filesystem directories.

Example:

crawlers:
  local_docs:
    module: crawler.files.main.run
    parameters:
      source: docs/
      public_url: https://mycompany.com/docs/

Parameters:

Parameter Default Description
source Required Path to source directory (relative to files volume)
public_url Required Base URL for constructing document links
sanitize_url true Whether to sanitize URLs
extensions ['.html', '.htm', '.pdf'] File extensions to process
ignore_regex '' Regex pattern to exclude files/directories
text_selectors ['body'] CSS selectors for HTML text extraction
title_selector 'title::text' CSS selector for document title
breadcrumb_selector '' CSS selector for breadcrumbs
encoding 'utf-8' Character encoding for HTML files
metadata {} Additional metadata for all documents

3. Azure Files

Syncs and indexes files from Azure Blob Storage.

Example:

crawlers:
  azure_docs:
    module: crawler.azurefiles.main.run
    parameters:
      source: https://myaccount.blob.core.windows.net/container
      public_url: https://docs.mycompany.com/

Parameters:

Parameter Default Description
source Required Azure Blob Storage URL
public_url Required Base URL for public document links
ignore_regex [] List of regex patterns to ignore files
sync_ignore_regex [] Regex patterns for sync exclusion (defaults to ignore_regex)
extensions [] File extensions to process (empty = all)
target ENV.FILES_PATH / 'azuresync' Local sync target directory
text_selectors [] CSS selectors for text extraction
title_selector 'title::text' CSS selector for document title
breadcrumb_selector '' CSS selector for breadcrumbs
metadata {'document_type': 'DOC'} Additional metadata

4. Confluence

Indexes content from Atlassian Confluence spaces.

Example:

crawlers:
  confluence:
    module: crawler.confluence.main.run
    parameters:
      space_key: DOCS
      credentials:
        base_url: https://mycompany.atlassian.net
        username: user@example.com
        api_token: your-api-token

Parameters:

Parameter Default Description
space_key Required Confluence space key to crawl
credentials {} Authentication credentials (base_url, username, api_token)
metadata {} Additional metadata for all documents

5. Document360

Indexes content from Document360 knowledge bases.

Example:

crawlers:
  knowledge_base:
    module: crawler.document360.main.run
    parameters:
      credentials:
        api_token: your-api-token

Parameters:

Parameter Default Description
base_url "https://apihub.document360.io/v2" Document360 API endpoint
project_version_id None Specific project version (None = all versions)
credentials {} API authentication (api_token required)
metadata {} Additional metadata for all documents

6. Excel Files

Extracts content from Excel spreadsheets.

Example:

crawlers:
  excel_data:
    module: crawler.excel.main.run
    parameters:
      path: /data/spreadsheets
      title_column: A
      text_column: B

Parameters:

Parameter Default Description
path Required Path to Excel files directory
title_column Required Column for document titles (name or letter)
text_column Required Column for document content (name or letter)
extensions ['.xlsx', '.xls'] Excel file extensions to process
skip_empty_text true Skip rows with empty text column
metadata_columns None Additional columns to include as metadata

7. PDF Directory

Indexes PDF files from a directory.

Example:

crawlers:
  pdf_library:
    module: crawler.pdfdir.main.run
    parameters:
      path: /data/pdfs

Parameters:

Parameter Default Description
path Required Path to directory containing PDF files

8. Helpdesk

Indexes tickets and articles from helpdesk systems.

Example:

crawlers:
  support_tickets:
    module: crawler.helpdesk.main.run
    parameters:
      credentials:
        base_url: https://support.example.com
        token: your-api-token

Parameters:

Parameter Default Description
credentials {} Authentication (base_url and token required)
max_tickets None Maximum number of tickets to fetch
include_articles true Include ticket articles/comments
proxies None Proxy configuration (e.g., {"https": "socks5://localhost:1337"})
query_params None Additional API query parameters
metadata {} Additional metadata for all documents

9. Jira

Indexes issues from Atlassian Jira projects.

Example:

crawlers:
  jira_issues:
    module: crawler.jira.main.run
    parameters:
      projects: [PROJ1, PROJ2]
      credentials:
        server: https://mycompany.atlassian.net
        bearer_token: your-token

Parameters:

Parameter Default Description
projects [] List of project keys (empty = all projects)
credentials Required Authentication (see below)
lookback_period_days 1095 (3 years) How many days to look back for issues
max_issues 10000 Maximum number of issues to fetch

Credentials Options: - Bearer token: {server, bearer_token} - Basic auth: {server, basic_username, basic_password} - Optional: proxies for proxy configuration

10. Slack

Indexes conversations from Slack workspaces.

Example:

crawlers:
  slack_history:
    module: crawler.slack.main.run
    parameters:
      workspace_url: https://mycompany.slack.com
      bot_token: xoxb-your-bot-token
      channels: [general, support]

Parameters:

Parameter Default Description
workspace_url Required Slack workspace URL
bot_token Required Slack bot token (xoxb-...)
channels [] Specific channels to crawl (empty = all accessible)
exclude_channels [] Channels to exclude from crawling
grouping_mode 'hybrid' Message grouping: 'thread', 'time', or 'hybrid'
inactivity_threshold_hours 24 Hours before creating new document (time-based)
max_messages_per_document 200 Maximum messages per document
lookback_days None Days to look back (None = all history)
exclude_archived true Skip archived channels
page_size 200 Messages per API call (max 1000)
metadata {} Additional metadata for all documents

11. YouTrack

Indexes issues from JetBrains YouTrack.

Example:

crawlers:
  youtrack:
    module: crawler.youtrack.main.run
    parameters:
      credentials:
        base_url: https://youtrack.example.com
        token: your-api-token
      projects: [PROJ1, PROJ2]

Parameters:

Parameter Default Description
credentials {} Authentication (base_url and token required)
projects None Project IDs to crawl (None = all projects)
max_issues_per_project None Limit issues per project
max_comments_per_issues None Limit comments per issue
proxies None Proxy configuration
metadata {} Additional metadata for all documents

12. Salesforce

Indexes objects from Salesforce CRM.

Example:

crawlers:
  salesforce:
    module: crawler.salesforce.client
    parameters:
      credentials:
        base_url: https://mycompany.my.salesforce.com
        client_id: your-client-id
        client_secret: your-client-secret
        version: v59.0

Parameters:

Parameter Default Description
credentials Required OAuth credentials (see below)

Credentials: - base_url: Salesforce instance URL - client_id: OAuth client ID - client_secret: OAuth client secret - version: API version (e.g., v59.0)

Custom Data Sources

To create a custom data source, implement a run function in a Python module and reference it in the configuration:

# my_custom_crawler.py
from crawler.struct import RunCrawlerConfig, ParsedDocument

def run(conf: RunCrawlerConfig):
    # Your crawler implementation
    for doc in your_data_source:
        yield ParsedDocument(
            title=doc.title,
            text=doc.content,
            url=doc.url,
            breadcrumbs=doc.path,
            metadata=doc.metadata
        )
# config.yaml
crawlers:
  custom_source:
    module: my_custom_crawler.run
    parameters:
      # Your custom parameters

For more examples, see: SerenityGPT-examples

Do not see your data source?

There are more data sources available beyond those documented here. Please contact us for documentation and enablement of additional integrations tailored to your specific needs.