Data Sources Configuration

This page documents all supported data sources and their configuration parameters. Each data source can be configured in the config.yaml file under the crawlers section of a tenant.

Adobe Marketo

Indexes marketing content assets from Adobe Marketo Engage, including emails, landing pages, snippets, and forms. The connector retrieves assets through the Marketo REST API.

How it connects: The connector authenticates using the OAuth 2.0 client credentials grant. It first requests an access token from the Marketo identity endpoint using the client ID and client secret, then uses that token to call the Marketo Asset API to retrieve marketing content. Access tokens expire after one hour and are refreshed automatically.

Example:

crawlers:
  marketo_assets:
    module: crawler.marketo.main.run
    parameters:
      credentials:
        base_url: https://123-ABC-456.mktorest.com
        identity_url: https://123-ABC-456.mktorest.com/identity
        client_id: your-client-id
        client_secret: your-client-secret

Parameters:

Parameter	Default	Description
`credentials`	Required	OAuth credentials (see below)
`asset_types`	`['emails', 'landing_pages']`	Types of assets to retrieve
`metadata`	`{}`	Additional metadata for all documents

Credentials:

base_url: Marketo REST API base URL (found in Admin > Integration > Web Services)
identity_url: Marketo identity endpoint for token retrieval
client_id: OAuth client ID (found in Admin > Integration > LaunchPoint > View Details)
client_secret: OAuth client secret (found alongside client ID)

Authentication flow:

The connector requests an access token: GET <identity_url>/oauth/token?grant_type=client_credentials&client_id=<id>&client_secret=<secret>
The token is included in subsequent API calls as Authorization: Bearer <access_token>
Tokens expire after 3,600 seconds (one hour) and are renewed automatically

Note

The LaunchPoint custom service must be created with an API-only user that has the appropriate asset access permissions. See the Marketo REST API documentation for setup instructions.

Azure Files

Syncs and indexes files from Azure Blob Storage.

Example:

crawlers:
  azure_docs:
    module: crawler.azurefiles.main.run
    parameters:
      source: https://myaccount.blob.core.windows.net/container
      public_url: https://docs.mycompany.com/

Parameters:

Parameter	Default	Description
`source`	Required	Azure Blob Storage URL
`public_url`	Required	Base URL for public document links
`ignore_regex`	`[]`	List of regex patterns to ignore files
`sync_ignore_regex`	`[]`	Regex patterns for sync exclusion (defaults to ignore_regex)
`extensions`	`[]`	File extensions to process (empty = all)
`target`	`ENV.FILES_PATH / 'azuresync'`	Local sync target directory
`text_selectors`	`[]`	CSS selectors for text extraction
`title_selector`	`'title::text'`	CSS selector for document title
`breadcrumb_selector`	`''`	CSS selector for breadcrumbs
`metadata`	`{'document_type': 'DOC'}`	Additional metadata

Confluence

Indexes content from Atlassian Confluence spaces.

How it connects: The connector authenticates using HTTP Basic Authentication with a username and API token, then paginates through all pages in the specified space. Page content is converted from Confluence HTML to Markdown, with Confluence-specific macros stripped automatically.

Example:

crawlers:
  confluence:
    module: crawler.confluence.main
    parameters:
      space_key: DOCS
      credentials:
        server: https://mycompany.atlassian.net
        basic_username: user@example.com
        basic_password: your-api-token

Parameters:

Parameter	Default	Description
`space_key`	Required	Confluence space key to crawl
`credentials`	Required	Authentication credentials (see below)
`metadata`	`{}`	Additional metadata for all documents

Credentials:

server: Confluence instance URL (e.g., https://mycompany.atlassian.net)
basic_username: Email address of the Confluence user or service account
basic_password: API token generated from Atlassian account settings

Setting up access:

SerenityGPT connects to Confluence as a specific user account. That account determines which spaces and pages the connector can index. To follow the principle of least privilege, create a dedicated account and grant it read-only access to only the spaces you want indexed.

Creating a service account (recommended):

Atlassian provides service accounts that are not tied to a person and do not consume a Confluence license seat. Every organization gets 5 free service accounts (more with Atlassian Guard Standard).

Go to admin.atlassian.com and select your organization.
Select Directory > Service accounts > Create a service account.
Enter a descriptive name (e.g., serenity-crawler) and grant the account access to Confluence.

If service accounts are not available, create a dedicated regular Atlassian user account instead. This consumes a Confluence license seat.

Generating an API token:

Service accounts: In admin.atlassian.com > Directory > Service accounts, select the account, then Create credentials > API token. Select read-only scopes such as read:confluence-content.all and read:confluence-space.summary.
Regular users: Go to id.atlassian.com/manage-profile/security/api-tokens and select Create API token.

API tokens expire within 1 year maximum. Plan for periodic rotation and store the token securely — it cannot be recovered after creation.

Restricting access to specific spaces:

The Confluence REST API respects the same permissions as the web interface. If the account cannot view a space in the browser, it cannot access that space through the API.

Create a dedicated Confluence group (e.g., serenity-api-readers) and add the service account to it.
For each space the connector should index, go to Space settings > Space access, add the group, and grant only the View permission.
Verify the account is not a member of broad-access groups (such as the default confluence-users group) that would grant unintended access to other spaces.

Note

Confluence permissions are additive. The account receives the union of all permissions from all its groups. The only way to prevent access to a space is to ensure the account has no path to the View permission on that space.

Document360

Indexes content from Document360 knowledge bases.

Example:

crawlers:
  knowledge_base:
    module: crawler.document360.main.run
    parameters:
      credentials:
        api_token: your-api-token

Parameters:

Parameter	Default	Description
`base_url`	`"https://apihub.document360.io/v2"`	Document360 API endpoint
`project_version_id`	`None`	Specific project version (None = all versions)
`credentials`	`{}`	API authentication (api_token required)
`metadata`	`{}`	Additional metadata for all documents

Excel Files

Extracts content from Excel spreadsheets.

Example:

crawlers:
  excel_data:
    module: crawler.excel.main.run
    parameters:
      path: /data/spreadsheets
      title_column: A
      text_column: B

Parameters:

Parameter	Default	Description
`path`	Required	Path to Excel files directory
`title_column`	Required	Column for document titles (name or letter)
`text_column`	Required	Column for document content (name or letter)
`extensions`	`['.xlsx', '.xls']`	Excel file extensions to process
`skip_empty_text`	`true`	Skip rows with empty text column
`metadata_columns`	`None`	Additional columns to include as metadata

Helpdesk

Indexes tickets and articles from helpdesk systems.

Example:

crawlers:
  support_tickets:
    module: crawler.helpdesk.main.run
    parameters:
      credentials:
        base_url: https://support.example.com
        token: your-api-token

Parameters:

Parameter	Default	Description
`credentials`	`{}`	Authentication (base_url and token required)
`max_tickets`	`None`	Maximum number of tickets to fetch
`include_articles`	`true`	Include ticket articles/comments
`proxies`	`None`	Proxy configuration (e.g., {"https": "socks5://localhost:1337"})
`query_params`	`None`	Additional API query parameters
`metadata`	`{}`	Additional metadata for all documents

Jira

Indexes issues from Atlassian Jira projects.

Example:

crawlers:
  jira_issues:
    module: crawler.jira.main.run
    parameters:
      projects: [PROJ1, PROJ2]
      credentials:
        server: https://mycompany.atlassian.net
        bearer_token: your-token

Parameters:

Parameter	Default	Description
`projects`	`[]`	List of project keys (empty = all projects)
`credentials`	Required	Authentication (see below)
`lookback_period_days`	`1095` (3 years)	How many days to look back for issues
`max_issues`	`10000`	Maximum number of issues to fetch

Credentials Options: - Bearer token: {server, bearer_token} - Basic auth: {server, basic_username, basic_password} - Optional: proxies for proxy configuration

Local Files

Indexes files from local filesystem directories.

Example:

crawlers:
  local_docs:
    module: crawler.files.main.run
    parameters:
      source: docs/
      public_url: https://mycompany.com/docs/

Parameters:

Parameter	Default	Description
`source`	Required	Path to source directory (relative to files volume)
`public_url`	Required	Base URL for constructing document links
`sanitize_url`	`true`	Whether to sanitize URLs
`extensions`	`['.html', '.htm', '.pdf']`	File extensions to process
`ignore_regex`	`''`	Regex pattern to exclude files/directories
`text_selectors`	`['body']`	CSS selectors for HTML text extraction
`title_selector`	`'title::text'`	CSS selector for document title
`breadcrumb_selector`	`''`	CSS selector for breadcrumbs
`encoding`	`'utf-8'`	Character encoding for HTML files
`metadata`	`{}`	Additional metadata for all documents

PDF Directory

Indexes PDF files from a directory.

Example:

crawlers:
  pdf_library:
    module: crawler.pdfdir.main.run
    parameters:
      path: /data/pdfs

Parameters:

Parameter	Default	Description
`path`	Required	Path to directory containing PDF files

Salesforce

Indexes objects from Salesforce CRM.

Example:

crawlers:
  salesforce:
    module: crawler.salesforce.client
    parameters:
      credentials:
        base_url: https://mycompany.my.salesforce.com
        client_id: your-client-id
        client_secret: your-client-secret
        version: v59.0

Parameters:

Parameter	Default	Description
`credentials`	Required	OAuth credentials (see below)

Credentials: - base_url: Salesforce instance URL - client_id: OAuth client ID - client_secret: OAuth client secret - version: API version (e.g., v59.0)

SharePoint

Indexes documents and files from SharePoint Online document libraries.

Example:

crawlers:
  sharepoint_docs:
    module: crawler.sharepoint.main.run
    parameters:
      site_url: https://mycompany.sharepoint.com/sites/docs
      document_library: Shared Documents
      credentials:
        tenant_id: your-tenant-id
        client_id: your-client-id
        client_secret: your-client-secret

Parameters:

Parameter	Default	Description
`site_url`	Required	SharePoint site URL
`document_library`	`'Shared Documents'`	Name of the document library to crawl
`folder_path`	`''`	Subfolder path within the library (empty = root)
`credentials`	Required	Azure AD app credentials (see below)
`extensions`	`[]`	File extensions to process (empty = all supported)
`ignore_regex`	`[]`	Regex patterns to exclude files or folders
`metadata`	`{}`	Additional metadata for all documents

Credentials:

tenant_id: Azure AD tenant ID
client_id: Azure AD application (client) ID
client_secret: Azure AD client secret

Permissions:

The Azure AD app registration requires the following Microsoft Graph API application permissions:

Sites.Read.All — read items in all site collections
Files.Read.All — read all files that the app has access to

Grant admin consent for these permissions in the Azure portal under App registrations > API permissions. Application permissions allow the crawler to access SharePoint without a signed-in user, so restrict the client secret to authorized personnel only.

Skilljar

Indexes courses and lesson content from the Skilljar customer education platform. Skilljar is a learning management system (LMS) used to deliver training programs, certifications, and onboarding content. The connector retrieves course catalogs, lesson details, and associated learning materials through the Skilljar REST API.

How it connects: The connector authenticates using an API key generated from the Skilljar dashboard. Skilljar uses HTTP Basic authentication where the API key is passed as the username with no password. The connector calls the Skilljar API at api.skilljar.com to list published courses, retrieve lesson content, and extract associated metadata.

Example:

crawlers:
  skilljar_courses:
    module: crawler.skilljar.main.run
    parameters:
      credentials:
        api_key: your-api-key

Parameters:

Parameter	Default	Description
`credentials`	Required	Authentication (api_key required)
`domain`	`None`	Skilljar domain to crawl (None = all domains)
`metadata`	`{}`	Additional metadata for all documents

Credentials:

api_key: Skilljar API key, generated from Organization Settings > API Credentials in the Skilljar dashboard

Note

Skilljar offers read-only and standard API keys. A read-only key is sufficient for the connector. The API enforces a rate limit of 5,000 requests per hour per organization.

Slack

Indexes conversations from Slack workspaces.

Example:

crawlers:
  slack_history:
    module: crawler.slack.main.run
    parameters:
      workspace_url: https://mycompany.slack.com
      bot_token: xoxb-your-bot-token
      channels: [general, support]

Parameters:

Parameter	Default	Description
`workspace_url`	Required	Slack workspace URL
`bot_token`	Required	Slack bot token (xoxb-...)
`channels`	`[]`	Specific channels to crawl (empty = all accessible)
`exclude_channels`	`[]`	Channels to exclude from crawling
`grouping_mode`	`'hybrid'`	Message grouping: 'thread', 'time', or 'hybrid'
`inactivity_threshold_hours`	`24`	Hours before creating new document (time-based)
`max_messages_per_document`	`200`	Maximum messages per document
`lookback_days`	`None`	Days to look back (None = all history)
`exclude_archived`	`true`	Skip archived channels
`page_size`	`200`	Messages per API call (max 1000)
`metadata`	`{}`	Additional metadata for all documents

Vanilla Forums

Indexes discussions, comments, and knowledge base articles from Vanilla Forums (Higher Logic Vanilla) community platforms. The connector retrieves community content through the Vanilla API v2.

How it connects: The connector authenticates using a personal access token and calls the Vanilla API v2 endpoints at https://<your-community>/api/v2/. It retrieves discussions, comments, categories, and knowledge base articles. The token is passed in the Authorization: Bearer <token> header with each request.

Example:

crawlers:
  vanilla_community:
    module: crawler.vanilla.main.run
    parameters:
      credentials:
        base_url: https://community.example.com
        access_token: your-personal-access-token

Parameters:

Parameter	Default	Description
`credentials`	Required	Authentication (base_url and access_token required)
`categories`	`[]`	Category IDs to crawl (empty = all categories)
`include_knowledge_base`	`true`	Include knowledge base articles
`metadata`	`{}`	Additional metadata for all documents

Credentials:

base_url: URL of your Vanilla Forums community (e.g., https://community.example.com)
access_token: Personal access token, generated from your Vanilla Forums profile under Edit Profile > Access Tokens

Note

The access token inherits the permissions of the user who generated it. Use an administrator account to ensure full access to all community content.

Vimeo

Indexes video content from Vimeo by downloading and transcribing video audio. The connector retrieves video metadata through the Vimeo API and uses speech-to-text transcription to convert spoken content into searchable text.

How it connects: The connector authenticates using a personal access token from the Vimeo Developer portal. It calls the Vimeo API (api.vimeo.com) to list videos from a user account, channel, or folder. For each video, the connector first checks for existing text tracks (captions or subtitles) through the /videos/{video_id}/texttracks endpoint. If captions are available, the connector downloads them directly. If no captions exist, the connector downloads the video audio and runs it through a speech-to-text transcription service to generate a text representation of the content.

Example:

crawlers:
  vimeo_videos:
    module: crawler.vimeo.main.run
    parameters:
      credentials:
        access_token: your-personal-access-token
      user_id: your-user-id

Parameters:

Parameter	Default	Description
`credentials`	Required	Authentication (access_token required)
`user_id`	`'me'`	Vimeo user ID or `'me'` for the token owner
`folder_id`	`None`	Specific folder/project to crawl (None = all videos)
`prefer_captions`	`true`	Use existing captions instead of transcribing when available
`metadata`	`{}`	Additional metadata for all documents

Credentials:

access_token: Vimeo personal access token, generated from the Vimeo Developer portal under My Apps > Authentication

Transcription process:

The connector lists all videos using the Vimeo API (GET /me/videos or GET /users/{user_id}/videos)
For each video, it checks for existing text tracks (GET /videos/{video_id}/texttracks)
If captions or subtitles exist, the connector downloads them as WebVTT files using the temporary download link provided in the API response
If no text tracks are available, the connector downloads the video audio and transcribes it using an integrated speech-to-text service
The transcribed text is indexed alongside video metadata (title, description, duration, tags)

Note

The personal access token must be generated by the account owner of the videos. Ensure the token has the private and video_files scopes enabled to allow access to video content and text tracks. Auto-generated captions are available on Vimeo Plus plans and above.

Web Crawler

Crawls web pages starting from specified URLs and follows links recursively.

Example:

crawlers:
  docs_site:
    module: crawler.web.main.run
    parameters:
      start_urls:
        - https://docs.example.com
      match_urls:
        - https://docs.example.com/.*

Parameters:

Parameter	Default	Description
`start_urls`	`[]`	List of URLs to start crawling from
`match_urls`	`[]`	Regular expressions to match URLs that should be crawled
`sitemap_urls`	`[]`	List of sitemap URLs to parse for additional pages
`match_content_types`	`['text/html', 'application/pdf']`	Content types to process
`text_selectors`	`[]`	CSS selectors to extract text (first match used)
`breadcrumb_selector`	`''`	CSS selector for breadcrumb navigation
`title_selector`	`'title::text'`	CSS selector for document title
`metadata`	`{}`	Additional metadata for all documents

YouTube

Indexes video content from YouTube channels or playlists by downloading and transcribing video audio. The connector retrieves video metadata through the YouTube Data API v3 and converts spoken content into searchable text.

How it connects: The connector uses the YouTube Data API v3 to list videos from a channel or playlist. An API key is sufficient for listing public video metadata. For each video, the connector retrieves available caption tracks. If auto-generated or manually uploaded captions exist, the connector extracts the transcript text. If captions are not available, the connector downloads the video audio and runs it through a speech-to-text transcription service.

Example:

crawlers:
  youtube_videos:
    module: crawler.youtube.main.run
    parameters:
      credentials:
        api_key: your-youtube-api-key
      channel_id: UCxxxxxxxxxxxxxxxx

Parameters:

Parameter	Default	Description
`credentials`	Required	Authentication (see below)
`channel_id`	`None`	YouTube channel ID to crawl
`playlist_id`	`None`	YouTube playlist ID to crawl
`prefer_captions`	`true`	Use existing captions instead of transcribing when available
`max_videos`	`None`	Maximum number of videos to process
`metadata`	`{}`	Additional metadata for all documents

Credentials:

api_key: YouTube Data API v3 key, generated from the Google Cloud Console under APIs & Services > Credentials
oauth_client_id (optional): Required only for downloading caption tracks from the Captions API
oauth_client_secret (optional): Required alongside oauth_client_id

Transcription process:

The connector retrieves the channel's uploads playlist using the Channels API, then lists all videos through the PlaylistItems API (playlistItems.list). Alternatively, it lists videos from a specific playlist directly.
For each video, the connector retrieves available caption tracks and extracts the transcript text
If auto-generated or manual captions are available, the connector downloads the caption content in SRT or VTT format
If captions are not available, the connector downloads the video audio and transcribes it using an integrated speech-to-text service
The transcribed or captioned text is indexed alongside video metadata (title, description, channel, publish date)

Note

The YouTube Data API v3 must be enabled in your Google Cloud project. The API enforces a default quota of 10,000 units per day. Listing videos costs 1 unit per request, while search operations cost 100 units each.

YouTrack

Indexes issues from JetBrains YouTrack.

Example:

crawlers:
  youtrack:
    module: crawler.youtrack.main.run
    parameters:
      credentials:
        base_url: https://youtrack.example.com
        token: your-api-token
      projects: [PROJ1, PROJ2]

Parameters:

Parameter	Default	Description
`credentials`	`{}`	Authentication (base_url and token required)
`projects`	`None`	Project IDs to crawl (None = all projects)
`max_issues_per_project`	`None`	Limit issues per project
`max_comments_per_issues`	`None`	Limit comments per issue
`proxies`	`None`	Proxy configuration
`metadata`	`{}`	Additional metadata for all documents

Zoomin

Indexes technical documentation from the Zoomin knowledge delivery platform. Zoomin aggregates content from multiple authoring tools and content management systems into a unified, searchable documentation portal. The connector uses the Zoomin API to retrieve published content.

How it connects: The connector authenticates against the Zoomin API using an API token provided by your Zoomin account team. It calls the search and content retrieval endpoints to enumerate and download documents, topics, and knowledge articles from your Zoomin portal. The API host follows the pattern api.<your-portal>.zoominsoftware.com.

Example:

crawlers:
  zoomin_docs:
    module: crawler.zoomin.main.run
    parameters:
      credentials:
        base_url: https://api.docs.example.zoominsoftware.com
        api_token: your-api-token

Parameters:

Parameter	Default	Description
`credentials`	Required	Authentication (base_url and api_token required)
`metadata`	`{}`	Additional metadata for all documents

Credentials:

base_url: Zoomin API base URL (provided by your Zoomin account team)
api_token: API token for authentication (passed as Authorization: Bearer <token>)

Note

Zoomin API credentials and endpoint details are provisioned per customer. Contact your Zoomin account representative to obtain the API base URL and authentication token for your portal.

Custom Data Sources

To create a custom data source, implement a run function in a Python module and reference it in the configuration:

# my_custom_crawler.py
from crawler.struct import RunCrawlerConfig, ParsedDocument

def run(conf: RunCrawlerConfig):
    # Your crawler implementation
    for doc in your_data_source:
        yield ParsedDocument(
            title=doc.title,
            text=doc.content,
            url=doc.url,
            breadcrumbs=doc.path,
            metadata=doc.metadata
        )

# config.yaml
crawlers:
  custom_source:
    module: my_custom_crawler.run
    parameters:
      # Your custom parameters

For more examples, see: SerenityGPT-examples

Do not see your data source?

There are more data sources available beyond those documented here. Please contact us for documentation and enablement of additional integrations tailored to your specific needs.