Skip to content

Data Sources Configuration

This page documents all supported data sources and their configuration parameters. Each data source can be configured in the config.yaml file under the crawlers section of a tenant.

Adobe Marketo

Indexes marketing content assets from Adobe Marketo Engage, including emails, landing pages, snippets, and forms. The connector retrieves assets through the Marketo REST API.

How it connects: The connector authenticates using the OAuth 2.0 client credentials grant. It first requests an access token from the Marketo identity endpoint using the client ID and client secret, then uses that token to call the Marketo Asset API to retrieve marketing content. Access tokens expire after one hour and are refreshed automatically.

Example:

crawlers:
  marketo_assets:
    module: crawler.marketo.main.run
    parameters:
      credentials:
        base_url: https://123-ABC-456.mktorest.com
        identity_url: https://123-ABC-456.mktorest.com/identity
        client_id: your-client-id
        client_secret: your-client-secret

Parameters:

Parameter Default Description
credentials Required OAuth credentials (see below)
asset_types ['emails', 'landing_pages'] Types of assets to retrieve
metadata {} Additional metadata for all documents

Credentials:

  • base_url: Marketo REST API base URL (found in Admin > Integration > Web Services)
  • identity_url: Marketo identity endpoint for token retrieval
  • client_id: OAuth client ID (found in Admin > Integration > LaunchPoint > View Details)
  • client_secret: OAuth client secret (found alongside client ID)

Authentication flow:

  1. The connector requests an access token: GET <identity_url>/oauth/token?grant_type=client_credentials&client_id=<id>&client_secret=<secret>
  2. The token is included in subsequent API calls as Authorization: Bearer <access_token>
  3. Tokens expire after 3,600 seconds (one hour) and are renewed automatically

Note

The LaunchPoint custom service must be created with an API-only user that has the appropriate asset access permissions. See the Marketo REST API documentation for setup instructions.

Azure Files

Syncs and indexes files from Azure Blob Storage.

Example:

crawlers:
  azure_docs:
    module: crawler.azurefiles.main.run
    parameters:
      source: https://myaccount.blob.core.windows.net/container
      public_url: https://docs.mycompany.com/

Parameters:

Parameter Default Description
source Required Azure Blob Storage URL
public_url Required Base URL for public document links
ignore_regex [] List of regex patterns to ignore files
sync_ignore_regex [] Regex patterns for sync exclusion (defaults to ignore_regex)
extensions [] File extensions to process (empty = all)
target ENV.FILES_PATH / 'azuresync' Local sync target directory
text_selectors [] CSS selectors for text extraction
title_selector 'title::text' CSS selector for document title
breadcrumb_selector '' CSS selector for breadcrumbs
metadata {'document_type': 'DOC'} Additional metadata

Confluence

Indexes content from Atlassian Confluence spaces.

How it connects: The connector authenticates using HTTP Basic Authentication with a username and API token, then paginates through all pages in the specified space. Page content is converted from Confluence HTML to Markdown, with Confluence-specific macros stripped automatically.

Example:

crawlers:
  confluence:
    module: crawler.confluence.main
    parameters:
      space_key: DOCS
      credentials:
        server: https://mycompany.atlassian.net
        basic_username: user@example.com
        basic_password: your-api-token

Parameters:

Parameter Default Description
space_key Required Confluence space key to crawl
credentials Required Authentication credentials (see below)
metadata {} Additional metadata for all documents

Credentials:

  • server: Confluence instance URL (e.g., https://mycompany.atlassian.net)
  • basic_username: Email address of the Confluence user or service account
  • basic_password: API token generated from Atlassian account settings

Setting up access:

SerenityGPT connects to Confluence as a specific user account. That account determines which spaces and pages the connector can index. To follow the principle of least privilege, create a dedicated account and grant it read-only access to only the spaces you want indexed.

Creating a service account (recommended):

Atlassian provides service accounts that are not tied to a person and do not consume a Confluence license seat. Every organization gets 5 free service accounts (more with Atlassian Guard Standard).

  1. Go to admin.atlassian.com and select your organization.
  2. Select Directory > Service accounts > Create a service account.
  3. Enter a descriptive name (e.g., serenity-crawler) and grant the account access to Confluence.

If service accounts are not available, create a dedicated regular Atlassian user account instead. This consumes a Confluence license seat.

Generating an API token:

  • Service accounts: In admin.atlassian.com > Directory > Service accounts, select the account, then Create credentials > API token. Select read-only scopes such as read:confluence-content.all and read:confluence-space.summary.
  • Regular users: Go to id.atlassian.com/manage-profile/security/api-tokens and select Create API token.

API tokens expire within 1 year maximum. Plan for periodic rotation and store the token securely — it cannot be recovered after creation.

Restricting access to specific spaces:

The Confluence REST API respects the same permissions as the web interface. If the account cannot view a space in the browser, it cannot access that space through the API.

  1. Create a dedicated Confluence group (e.g., serenity-api-readers) and add the service account to it.
  2. For each space the connector should index, go to Space settings > Space access, add the group, and grant only the View permission.
  3. Verify the account is not a member of broad-access groups (such as the default confluence-users group) that would grant unintended access to other spaces.

Note

Confluence permissions are additive. The account receives the union of all permissions from all its groups. The only way to prevent access to a space is to ensure the account has no path to the View permission on that space.

Document360

Indexes content from Document360 knowledge bases.

Example:

crawlers:
  knowledge_base:
    module: crawler.document360.main.run
    parameters:
      credentials:
        api_token: your-api-token

Parameters:

Parameter Default Description
base_url "https://apihub.document360.io/v2" Document360 API endpoint
project_version_id None Specific project version (None = all versions)
credentials {} API authentication (api_token required)
metadata {} Additional metadata for all documents

Excel Files

Extracts content from Excel spreadsheets.

Example:

crawlers:
  excel_data:
    module: crawler.excel.main.run
    parameters:
      path: /data/spreadsheets
      title_column: A
      text_column: B

Parameters:

Parameter Default Description
path Required Path to Excel files directory
title_column Required Column for document titles (name or letter)
text_column Required Column for document content (name or letter)
extensions ['.xlsx', '.xls'] Excel file extensions to process
skip_empty_text true Skip rows with empty text column
metadata_columns None Additional columns to include as metadata

Helpdesk

Indexes tickets and articles from helpdesk systems.

Example:

crawlers:
  support_tickets:
    module: crawler.helpdesk.main.run
    parameters:
      credentials:
        base_url: https://support.example.com
        token: your-api-token

Parameters:

Parameter Default Description
credentials {} Authentication (base_url and token required)
max_tickets None Maximum number of tickets to fetch
include_articles true Include ticket articles/comments
proxies None Proxy configuration (e.g., {"https": "socks5://localhost:1337"})
query_params None Additional API query parameters
metadata {} Additional metadata for all documents

Jira

Indexes issues from Atlassian Jira projects.

Example:

crawlers:
  jira_issues:
    module: crawler.jira.main.run
    parameters:
      projects: [PROJ1, PROJ2]
      credentials:
        server: https://mycompany.atlassian.net
        bearer_token: your-token

Parameters:

Parameter Default Description
projects [] List of project keys (empty = all projects)
credentials Required Authentication (see below)
lookback_period_days 1095 (3 years) How many days to look back for issues
max_issues 10000 Maximum number of issues to fetch

Credentials Options: - Bearer token: {server, bearer_token} - Basic auth: {server, basic_username, basic_password} - Optional: proxies for proxy configuration

Local Files

Indexes files from local filesystem directories.

Example:

crawlers:
  local_docs:
    module: crawler.files.main.run
    parameters:
      source: docs/
      public_url: https://mycompany.com/docs/

Parameters:

Parameter Default Description
source Required Path to source directory (relative to files volume)
public_url Required Base URL for constructing document links
sanitize_url true Whether to sanitize URLs
extensions ['.html', '.htm', '.pdf'] File extensions to process
ignore_regex '' Regex pattern to exclude files/directories
text_selectors ['body'] CSS selectors for HTML text extraction
title_selector 'title::text' CSS selector for document title
breadcrumb_selector '' CSS selector for breadcrumbs
encoding 'utf-8' Character encoding for HTML files
metadata {} Additional metadata for all documents

PDF Directory

Indexes PDF files from a directory.

Example:

crawlers:
  pdf_library:
    module: crawler.pdfdir.main.run
    parameters:
      path: /data/pdfs

Parameters:

Parameter Default Description
path Required Path to directory containing PDF files

Salesforce

Indexes objects from Salesforce CRM.

Example:

crawlers:
  salesforce:
    module: crawler.salesforce.client
    parameters:
      credentials:
        base_url: https://mycompany.my.salesforce.com
        client_id: your-client-id
        client_secret: your-client-secret
        version: v59.0

Parameters:

Parameter Default Description
credentials Required OAuth credentials (see below)

Credentials: - base_url: Salesforce instance URL - client_id: OAuth client ID - client_secret: OAuth client secret - version: API version (e.g., v59.0)

SharePoint

Indexes documents and files from SharePoint Online document libraries.

Example:

crawlers:
  sharepoint_docs:
    module: crawler.sharepoint.main.run
    parameters:
      site_url: https://mycompany.sharepoint.com/sites/docs
      document_library: Shared Documents
      credentials:
        tenant_id: your-tenant-id
        client_id: your-client-id
        client_secret: your-client-secret

Parameters:

Parameter Default Description
site_url Required SharePoint site URL
document_library 'Shared Documents' Name of the document library to crawl
folder_path '' Subfolder path within the library (empty = root)
credentials Required Azure AD app credentials (see below)
extensions [] File extensions to process (empty = all supported)
ignore_regex [] Regex patterns to exclude files or folders
metadata {} Additional metadata for all documents

Credentials:

  • tenant_id: Azure AD tenant ID
  • client_id: Azure AD application (client) ID
  • client_secret: Azure AD client secret

Permissions:

The Azure AD app registration requires the following Microsoft Graph API application permissions:

  • Sites.Read.All — read items in all site collections
  • Files.Read.All — read all files that the app has access to

Grant admin consent for these permissions in the Azure portal under App registrations > API permissions. Application permissions allow the crawler to access SharePoint without a signed-in user, so restrict the client secret to authorized personnel only.

Skilljar

Indexes courses and lesson content from the Skilljar customer education platform. Skilljar is a learning management system (LMS) used to deliver training programs, certifications, and onboarding content. The connector retrieves course catalogs, lesson details, and associated learning materials through the Skilljar REST API.

How it connects: The connector authenticates using an API key generated from the Skilljar dashboard. Skilljar uses HTTP Basic authentication where the API key is passed as the username with no password. The connector calls the Skilljar API at api.skilljar.com to list published courses, retrieve lesson content, and extract associated metadata.

Example:

crawlers:
  skilljar_courses:
    module: crawler.skilljar.main.run
    parameters:
      credentials:
        api_key: your-api-key

Parameters:

Parameter Default Description
credentials Required Authentication (api_key required)
domain None Skilljar domain to crawl (None = all domains)
metadata {} Additional metadata for all documents

Credentials:

  • api_key: Skilljar API key, generated from Organization Settings > API Credentials in the Skilljar dashboard

Note

Skilljar offers read-only and standard API keys. A read-only key is sufficient for the connector. The API enforces a rate limit of 5,000 requests per hour per organization.

Slack

Indexes conversations from Slack workspaces.

Example:

crawlers:
  slack_history:
    module: crawler.slack.main.run
    parameters:
      workspace_url: https://mycompany.slack.com
      bot_token: xoxb-your-bot-token
      channels: [general, support]

Parameters:

Parameter Default Description
workspace_url Required Slack workspace URL
bot_token Required Slack bot token (xoxb-...)
channels [] Specific channels to crawl (empty = all accessible)
exclude_channels [] Channels to exclude from crawling
grouping_mode 'hybrid' Message grouping: 'thread', 'time', or 'hybrid'
inactivity_threshold_hours 24 Hours before creating new document (time-based)
max_messages_per_document 200 Maximum messages per document
lookback_days None Days to look back (None = all history)
exclude_archived true Skip archived channels
page_size 200 Messages per API call (max 1000)
metadata {} Additional metadata for all documents

Vanilla Forums

Indexes discussions, comments, and knowledge base articles from Vanilla Forums (Higher Logic Vanilla) community platforms. The connector retrieves community content through the Vanilla API v2.

How it connects: The connector authenticates using a personal access token and calls the Vanilla API v2 endpoints at https://<your-community>/api/v2/. It retrieves discussions, comments, categories, and knowledge base articles. The token is passed in the Authorization: Bearer <token> header with each request.

Example:

crawlers:
  vanilla_community:
    module: crawler.vanilla.main.run
    parameters:
      credentials:
        base_url: https://community.example.com
        access_token: your-personal-access-token

Parameters:

Parameter Default Description
credentials Required Authentication (base_url and access_token required)
categories [] Category IDs to crawl (empty = all categories)
include_knowledge_base true Include knowledge base articles
metadata {} Additional metadata for all documents

Credentials:

  • base_url: URL of your Vanilla Forums community (e.g., https://community.example.com)
  • access_token: Personal access token, generated from your Vanilla Forums profile under Edit Profile > Access Tokens

Note

The access token inherits the permissions of the user who generated it. Use an administrator account to ensure full access to all community content.

Vimeo

Indexes video content from Vimeo by downloading and transcribing video audio. The connector retrieves video metadata through the Vimeo API and uses speech-to-text transcription to convert spoken content into searchable text.

How it connects: The connector authenticates using a personal access token from the Vimeo Developer portal. It calls the Vimeo API (api.vimeo.com) to list videos from a user account, channel, or folder. For each video, the connector first checks for existing text tracks (captions or subtitles) through the /videos/{video_id}/texttracks endpoint. If captions are available, the connector downloads them directly. If no captions exist, the connector downloads the video audio and runs it through a speech-to-text transcription service to generate a text representation of the content.

Example:

crawlers:
  vimeo_videos:
    module: crawler.vimeo.main.run
    parameters:
      credentials:
        access_token: your-personal-access-token
      user_id: your-user-id

Parameters:

Parameter Default Description
credentials Required Authentication (access_token required)
user_id 'me' Vimeo user ID or 'me' for the token owner
folder_id None Specific folder/project to crawl (None = all videos)
prefer_captions true Use existing captions instead of transcribing when available
metadata {} Additional metadata for all documents

Credentials:

  • access_token: Vimeo personal access token, generated from the Vimeo Developer portal under My Apps > Authentication

Transcription process:

  1. The connector lists all videos using the Vimeo API (GET /me/videos or GET /users/{user_id}/videos)
  2. For each video, it checks for existing text tracks (GET /videos/{video_id}/texttracks)
  3. If captions or subtitles exist, the connector downloads them as WebVTT files using the temporary download link provided in the API response
  4. If no text tracks are available, the connector downloads the video audio and transcribes it using an integrated speech-to-text service
  5. The transcribed text is indexed alongside video metadata (title, description, duration, tags)

Note

The personal access token must be generated by the account owner of the videos. Ensure the token has the private and video_files scopes enabled to allow access to video content and text tracks. Auto-generated captions are available on Vimeo Plus plans and above.

Web Crawler

Crawls web pages starting from specified URLs and follows links recursively.

Example:

crawlers:
  docs_site:
    module: crawler.web.main.run
    parameters:
      start_urls:
        - https://docs.example.com
      match_urls:
        - https://docs.example.com/.*

Parameters:

Parameter Default Description
start_urls [] List of URLs to start crawling from
match_urls [] Regular expressions to match URLs that should be crawled
sitemap_urls [] List of sitemap URLs to parse for additional pages
match_content_types ['text/html', 'application/pdf'] Content types to process
text_selectors [] CSS selectors to extract text (first match used)
breadcrumb_selector '' CSS selector for breadcrumb navigation
title_selector 'title::text' CSS selector for document title
metadata {} Additional metadata for all documents

YouTube

Indexes video content from YouTube channels or playlists by downloading and transcribing video audio. The connector retrieves video metadata through the YouTube Data API v3 and converts spoken content into searchable text.

How it connects: The connector uses the YouTube Data API v3 to list videos from a channel or playlist. An API key is sufficient for listing public video metadata. For each video, the connector retrieves available caption tracks. If auto-generated or manually uploaded captions exist, the connector extracts the transcript text. If captions are not available, the connector downloads the video audio and runs it through a speech-to-text transcription service.

Example:

crawlers:
  youtube_videos:
    module: crawler.youtube.main.run
    parameters:
      credentials:
        api_key: your-youtube-api-key
      channel_id: UCxxxxxxxxxxxxxxxx

Parameters:

Parameter Default Description
credentials Required Authentication (see below)
channel_id None YouTube channel ID to crawl
playlist_id None YouTube playlist ID to crawl
prefer_captions true Use existing captions instead of transcribing when available
max_videos None Maximum number of videos to process
metadata {} Additional metadata for all documents

Credentials:

  • api_key: YouTube Data API v3 key, generated from the Google Cloud Console under APIs & Services > Credentials
  • oauth_client_id (optional): Required only for downloading caption tracks from the Captions API
  • oauth_client_secret (optional): Required alongside oauth_client_id

Transcription process:

  1. The connector retrieves the channel's uploads playlist using the Channels API, then lists all videos through the PlaylistItems API (playlistItems.list). Alternatively, it lists videos from a specific playlist directly.
  2. For each video, the connector retrieves available caption tracks and extracts the transcript text
  3. If auto-generated or manual captions are available, the connector downloads the caption content in SRT or VTT format
  4. If captions are not available, the connector downloads the video audio and transcribes it using an integrated speech-to-text service
  5. The transcribed or captioned text is indexed alongside video metadata (title, description, channel, publish date)

Note

The YouTube Data API v3 must be enabled in your Google Cloud project. The API enforces a default quota of 10,000 units per day. Listing videos costs 1 unit per request, while search operations cost 100 units each.

YouTrack

Indexes issues from JetBrains YouTrack.

Example:

crawlers:
  youtrack:
    module: crawler.youtrack.main.run
    parameters:
      credentials:
        base_url: https://youtrack.example.com
        token: your-api-token
      projects: [PROJ1, PROJ2]

Parameters:

Parameter Default Description
credentials {} Authentication (base_url and token required)
projects None Project IDs to crawl (None = all projects)
max_issues_per_project None Limit issues per project
max_comments_per_issues None Limit comments per issue
proxies None Proxy configuration
metadata {} Additional metadata for all documents

Zoomin

Indexes technical documentation from the Zoomin knowledge delivery platform. Zoomin aggregates content from multiple authoring tools and content management systems into a unified, searchable documentation portal. The connector uses the Zoomin API to retrieve published content.

How it connects: The connector authenticates against the Zoomin API using an API token provided by your Zoomin account team. It calls the search and content retrieval endpoints to enumerate and download documents, topics, and knowledge articles from your Zoomin portal. The API host follows the pattern api.<your-portal>.zoominsoftware.com.

Example:

crawlers:
  zoomin_docs:
    module: crawler.zoomin.main.run
    parameters:
      credentials:
        base_url: https://api.docs.example.zoominsoftware.com
        api_token: your-api-token

Parameters:

Parameter Default Description
credentials Required Authentication (base_url and api_token required)
metadata {} Additional metadata for all documents

Credentials:

  • base_url: Zoomin API base URL (provided by your Zoomin account team)
  • api_token: API token for authentication (passed as Authorization: Bearer <token>)

Note

Zoomin API credentials and endpoint details are provisioned per customer. Contact your Zoomin account representative to obtain the API base URL and authentication token for your portal.

Custom Data Sources

To create a custom data source, implement a run function in a Python module and reference it in the configuration:

# my_custom_crawler.py
from crawler.struct import RunCrawlerConfig, ParsedDocument

def run(conf: RunCrawlerConfig):
    # Your crawler implementation
    for doc in your_data_source:
        yield ParsedDocument(
            title=doc.title,
            text=doc.content,
            url=doc.url,
            breadcrumbs=doc.path,
            metadata=doc.metadata
        )
# config.yaml
crawlers:
  custom_source:
    module: my_custom_crawler.run
    parameters:
      # Your custom parameters

For more examples, see: SerenityGPT-examples

Do not see your data source?

There are more data sources available beyond those documented here. Please contact us for documentation and enablement of additional integrations tailored to your specific needs.