Data Sources Configuration
This page documents all supported data sources and their configuration parameters. Each data source can be configured in the config.yaml file under the crawlers section of a tenant.
Adobe Marketo
Indexes marketing content assets from Adobe Marketo Engage, including emails, landing pages, snippets, and forms. The connector retrieves assets through the Marketo REST API.
How it connects: The connector authenticates using the OAuth 2.0 client credentials grant. It first requests an access token from the Marketo identity endpoint using the client ID and client secret, then uses that token to call the Marketo Asset API to retrieve marketing content. Access tokens expire after one hour and are refreshed automatically.
Example:
crawlers:
marketo_assets:
module: crawler.marketo.main.run
parameters:
credentials:
base_url: https://123-ABC-456.mktorest.com
identity_url: https://123-ABC-456.mktorest.com/identity
client_id: your-client-id
client_secret: your-client-secret
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | OAuth credentials (see below) |
asset_types |
['emails', 'landing_pages'] |
Types of assets to retrieve |
metadata |
{} |
Additional metadata for all documents |
Credentials:
base_url: Marketo REST API base URL (found in Admin > Integration > Web Services)identity_url: Marketo identity endpoint for token retrievalclient_id: OAuth client ID (found in Admin > Integration > LaunchPoint > View Details)client_secret: OAuth client secret (found alongside client ID)
Authentication flow:
- The connector requests an access token:
GET <identity_url>/oauth/token?grant_type=client_credentials&client_id=<id>&client_secret=<secret> - The token is included in subsequent API calls as
Authorization: Bearer <access_token> - Tokens expire after 3,600 seconds (one hour) and are renewed automatically
Note
The LaunchPoint custom service must be created with an API-only user that has the appropriate asset access permissions. See the Marketo REST API documentation for setup instructions.
Azure Files
Syncs and indexes files from Azure Blob Storage.
Example:
crawlers:
azure_docs:
module: crawler.azurefiles.main.run
parameters:
source: https://myaccount.blob.core.windows.net/container
public_url: https://docs.mycompany.com/
Parameters:
| Parameter | Default | Description |
|---|---|---|
source |
Required | Azure Blob Storage URL |
public_url |
Required | Base URL for public document links |
ignore_regex |
[] |
List of regex patterns to ignore files |
sync_ignore_regex |
[] |
Regex patterns for sync exclusion (defaults to ignore_regex) |
extensions |
[] |
File extensions to process (empty = all) |
target |
ENV.FILES_PATH / 'azuresync' |
Local sync target directory |
text_selectors |
[] |
CSS selectors for text extraction |
title_selector |
'title::text' |
CSS selector for document title |
breadcrumb_selector |
'' |
CSS selector for breadcrumbs |
metadata |
{'document_type': 'DOC'} |
Additional metadata |
Confluence
Indexes content from Atlassian Confluence spaces.
How it connects: The connector authenticates using HTTP Basic Authentication with a username and API token, then paginates through all pages in the specified space. Page content is converted from Confluence HTML to Markdown, with Confluence-specific macros stripped automatically.
Example:
crawlers:
confluence:
module: crawler.confluence.main
parameters:
space_key: DOCS
credentials:
server: https://mycompany.atlassian.net
basic_username: user@example.com
basic_password: your-api-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
space_key |
Required | Confluence space key to crawl |
credentials |
Required | Authentication credentials (see below) |
metadata |
{} |
Additional metadata for all documents |
Credentials:
server: Confluence instance URL (e.g.,https://mycompany.atlassian.net)basic_username: Email address of the Confluence user or service accountbasic_password: API token generated from Atlassian account settings
Setting up access:
SerenityGPT connects to Confluence as a specific user account. That account determines which spaces and pages the connector can index. To follow the principle of least privilege, create a dedicated account and grant it read-only access to only the spaces you want indexed.
Creating a service account (recommended):
Atlassian provides service accounts that are not tied to a person and do not consume a Confluence license seat. Every organization gets 5 free service accounts (more with Atlassian Guard Standard).
- Go to admin.atlassian.com and select your organization.
- Select Directory > Service accounts > Create a service account.
- Enter a descriptive name (e.g.,
serenity-crawler) and grant the account access to Confluence.
If service accounts are not available, create a dedicated regular Atlassian user account instead. This consumes a Confluence license seat.
Generating an API token:
- Service accounts: In admin.atlassian.com > Directory > Service accounts, select the account, then Create credentials > API token. Select read-only scopes such as
read:confluence-content.allandread:confluence-space.summary. - Regular users: Go to id.atlassian.com/manage-profile/security/api-tokens and select Create API token.
API tokens expire within 1 year maximum. Plan for periodic rotation and store the token securely — it cannot be recovered after creation.
Restricting access to specific spaces:
The Confluence REST API respects the same permissions as the web interface. If the account cannot view a space in the browser, it cannot access that space through the API.
- Create a dedicated Confluence group (e.g.,
serenity-api-readers) and add the service account to it. - For each space the connector should index, go to Space settings > Space access, add the group, and grant only the View permission.
- Verify the account is not a member of broad-access groups (such as the default
confluence-usersgroup) that would grant unintended access to other spaces.
Note
Confluence permissions are additive. The account receives the union of all permissions from all its groups. The only way to prevent access to a space is to ensure the account has no path to the View permission on that space.
Document360
Indexes content from Document360 knowledge bases.
Example:
crawlers:
knowledge_base:
module: crawler.document360.main.run
parameters:
credentials:
api_token: your-api-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
base_url |
"https://apihub.document360.io/v2" |
Document360 API endpoint |
project_version_id |
None |
Specific project version (None = all versions) |
credentials |
{} |
API authentication (api_token required) |
metadata |
{} |
Additional metadata for all documents |
Excel Files
Extracts content from Excel spreadsheets.
Example:
crawlers:
excel_data:
module: crawler.excel.main.run
parameters:
path: /data/spreadsheets
title_column: A
text_column: B
Parameters:
| Parameter | Default | Description |
|---|---|---|
path |
Required | Path to Excel files directory |
title_column |
Required | Column for document titles (name or letter) |
text_column |
Required | Column for document content (name or letter) |
extensions |
['.xlsx', '.xls'] |
Excel file extensions to process |
skip_empty_text |
true |
Skip rows with empty text column |
metadata_columns |
None |
Additional columns to include as metadata |
Helpdesk
Indexes tickets and articles from helpdesk systems.
Example:
crawlers:
support_tickets:
module: crawler.helpdesk.main.run
parameters:
credentials:
base_url: https://support.example.com
token: your-api-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
{} |
Authentication (base_url and token required) |
max_tickets |
None |
Maximum number of tickets to fetch |
include_articles |
true |
Include ticket articles/comments |
proxies |
None |
Proxy configuration (e.g., {"https": "socks5://localhost:1337"}) |
query_params |
None |
Additional API query parameters |
metadata |
{} |
Additional metadata for all documents |
Jira
Indexes issues from Atlassian Jira projects.
Example:
crawlers:
jira_issues:
module: crawler.jira.main.run
parameters:
projects: [PROJ1, PROJ2]
credentials:
server: https://mycompany.atlassian.net
bearer_token: your-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
projects |
[] |
List of project keys (empty = all projects) |
credentials |
Required | Authentication (see below) |
lookback_period_days |
1095 (3 years) |
How many days to look back for issues |
max_issues |
10000 |
Maximum number of issues to fetch |
Credentials Options:
- Bearer token: {server, bearer_token}
- Basic auth: {server, basic_username, basic_password}
- Optional: proxies for proxy configuration
Local Files
Indexes files from local filesystem directories.
Example:
crawlers:
local_docs:
module: crawler.files.main.run
parameters:
source: docs/
public_url: https://mycompany.com/docs/
Parameters:
| Parameter | Default | Description |
|---|---|---|
source |
Required | Path to source directory (relative to files volume) |
public_url |
Required | Base URL for constructing document links |
sanitize_url |
true |
Whether to sanitize URLs |
extensions |
['.html', '.htm', '.pdf'] |
File extensions to process |
ignore_regex |
'' |
Regex pattern to exclude files/directories |
text_selectors |
['body'] |
CSS selectors for HTML text extraction |
title_selector |
'title::text' |
CSS selector for document title |
breadcrumb_selector |
'' |
CSS selector for breadcrumbs |
encoding |
'utf-8' |
Character encoding for HTML files |
metadata |
{} |
Additional metadata for all documents |
PDF Directory
Indexes PDF files from a directory.
Example:
Parameters:
| Parameter | Default | Description |
|---|---|---|
path |
Required | Path to directory containing PDF files |
Salesforce
Indexes objects from Salesforce CRM.
Example:
crawlers:
salesforce:
module: crawler.salesforce.client
parameters:
credentials:
base_url: https://mycompany.my.salesforce.com
client_id: your-client-id
client_secret: your-client-secret
version: v59.0
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | OAuth credentials (see below) |
Credentials:
- base_url: Salesforce instance URL
- client_id: OAuth client ID
- client_secret: OAuth client secret
- version: API version (e.g., v59.0)
SharePoint
Indexes documents and files from SharePoint Online document libraries.
Example:
crawlers:
sharepoint_docs:
module: crawler.sharepoint.main.run
parameters:
site_url: https://mycompany.sharepoint.com/sites/docs
document_library: Shared Documents
credentials:
tenant_id: your-tenant-id
client_id: your-client-id
client_secret: your-client-secret
Parameters:
| Parameter | Default | Description |
|---|---|---|
site_url |
Required | SharePoint site URL |
document_library |
'Shared Documents' |
Name of the document library to crawl |
folder_path |
'' |
Subfolder path within the library (empty = root) |
credentials |
Required | Azure AD app credentials (see below) |
extensions |
[] |
File extensions to process (empty = all supported) |
ignore_regex |
[] |
Regex patterns to exclude files or folders |
metadata |
{} |
Additional metadata for all documents |
Credentials:
tenant_id: Azure AD tenant IDclient_id: Azure AD application (client) IDclient_secret: Azure AD client secret
Permissions:
The Azure AD app registration requires the following Microsoft Graph API application permissions:
Sites.Read.All— read items in all site collectionsFiles.Read.All— read all files that the app has access to
Grant admin consent for these permissions in the Azure portal under App registrations > API permissions. Application permissions allow the crawler to access SharePoint without a signed-in user, so restrict the client secret to authorized personnel only.
Skilljar
Indexes courses and lesson content from the Skilljar customer education platform. Skilljar is a learning management system (LMS) used to deliver training programs, certifications, and onboarding content. The connector retrieves course catalogs, lesson details, and associated learning materials through the Skilljar REST API.
How it connects: The connector authenticates using an API key generated from the Skilljar dashboard. Skilljar uses HTTP Basic authentication where the API key is passed as the username with no password. The connector calls the Skilljar API at api.skilljar.com to list published courses, retrieve lesson content, and extract associated metadata.
Example:
crawlers:
skilljar_courses:
module: crawler.skilljar.main.run
parameters:
credentials:
api_key: your-api-key
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | Authentication (api_key required) |
domain |
None |
Skilljar domain to crawl (None = all domains) |
metadata |
{} |
Additional metadata for all documents |
Credentials:
api_key: Skilljar API key, generated from Organization Settings > API Credentials in the Skilljar dashboard
Note
Skilljar offers read-only and standard API keys. A read-only key is sufficient for the connector. The API enforces a rate limit of 5,000 requests per hour per organization.
Slack
Indexes conversations from Slack workspaces.
Example:
crawlers:
slack_history:
module: crawler.slack.main.run
parameters:
workspace_url: https://mycompany.slack.com
bot_token: xoxb-your-bot-token
channels: [general, support]
Parameters:
| Parameter | Default | Description |
|---|---|---|
workspace_url |
Required | Slack workspace URL |
bot_token |
Required | Slack bot token (xoxb-...) |
channels |
[] |
Specific channels to crawl (empty = all accessible) |
exclude_channels |
[] |
Channels to exclude from crawling |
grouping_mode |
'hybrid' |
Message grouping: 'thread', 'time', or 'hybrid' |
inactivity_threshold_hours |
24 |
Hours before creating new document (time-based) |
max_messages_per_document |
200 |
Maximum messages per document |
lookback_days |
None |
Days to look back (None = all history) |
exclude_archived |
true |
Skip archived channels |
page_size |
200 |
Messages per API call (max 1000) |
metadata |
{} |
Additional metadata for all documents |
Vanilla Forums
Indexes discussions, comments, and knowledge base articles from Vanilla Forums (Higher Logic Vanilla) community platforms. The connector retrieves community content through the Vanilla API v2.
How it connects: The connector authenticates using a personal access token and calls the Vanilla API v2 endpoints at https://<your-community>/api/v2/. It retrieves discussions, comments, categories, and knowledge base articles. The token is passed in the Authorization: Bearer <token> header with each request.
Example:
crawlers:
vanilla_community:
module: crawler.vanilla.main.run
parameters:
credentials:
base_url: https://community.example.com
access_token: your-personal-access-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | Authentication (base_url and access_token required) |
categories |
[] |
Category IDs to crawl (empty = all categories) |
include_knowledge_base |
true |
Include knowledge base articles |
metadata |
{} |
Additional metadata for all documents |
Credentials:
base_url: URL of your Vanilla Forums community (e.g.,https://community.example.com)access_token: Personal access token, generated from your Vanilla Forums profile under Edit Profile > Access Tokens
Note
The access token inherits the permissions of the user who generated it. Use an administrator account to ensure full access to all community content.
Vimeo
Indexes video content from Vimeo by downloading and transcribing video audio. The connector retrieves video metadata through the Vimeo API and uses speech-to-text transcription to convert spoken content into searchable text.
How it connects: The connector authenticates using a personal access token from the Vimeo Developer portal. It calls the Vimeo API (api.vimeo.com) to list videos from a user account, channel, or folder. For each video, the connector first checks for existing text tracks (captions or subtitles) through the /videos/{video_id}/texttracks endpoint. If captions are available, the connector downloads them directly. If no captions exist, the connector downloads the video audio and runs it through a speech-to-text transcription service to generate a text representation of the content.
Example:
crawlers:
vimeo_videos:
module: crawler.vimeo.main.run
parameters:
credentials:
access_token: your-personal-access-token
user_id: your-user-id
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | Authentication (access_token required) |
user_id |
'me' |
Vimeo user ID or 'me' for the token owner |
folder_id |
None |
Specific folder/project to crawl (None = all videos) |
prefer_captions |
true |
Use existing captions instead of transcribing when available |
metadata |
{} |
Additional metadata for all documents |
Credentials:
access_token: Vimeo personal access token, generated from the Vimeo Developer portal under My Apps > Authentication
Transcription process:
- The connector lists all videos using the Vimeo API (
GET /me/videosorGET /users/{user_id}/videos) - For each video, it checks for existing text tracks (
GET /videos/{video_id}/texttracks) - If captions or subtitles exist, the connector downloads them as WebVTT files using the temporary download link provided in the API response
- If no text tracks are available, the connector downloads the video audio and transcribes it using an integrated speech-to-text service
- The transcribed text is indexed alongside video metadata (title, description, duration, tags)
Note
The personal access token must be generated by the account owner of the videos. Ensure the token has the private and video_files scopes enabled to allow access to video content and text tracks. Auto-generated captions are available on Vimeo Plus plans and above.
Web Crawler
Crawls web pages starting from specified URLs and follows links recursively.
Example:
crawlers:
docs_site:
module: crawler.web.main.run
parameters:
start_urls:
- https://docs.example.com
match_urls:
- https://docs.example.com/.*
Parameters:
| Parameter | Default | Description |
|---|---|---|
start_urls |
[] |
List of URLs to start crawling from |
match_urls |
[] |
Regular expressions to match URLs that should be crawled |
sitemap_urls |
[] |
List of sitemap URLs to parse for additional pages |
match_content_types |
['text/html', 'application/pdf'] |
Content types to process |
text_selectors |
[] |
CSS selectors to extract text (first match used) |
breadcrumb_selector |
'' |
CSS selector for breadcrumb navigation |
title_selector |
'title::text' |
CSS selector for document title |
metadata |
{} |
Additional metadata for all documents |
YouTube
Indexes video content from YouTube channels or playlists by downloading and transcribing video audio. The connector retrieves video metadata through the YouTube Data API v3 and converts spoken content into searchable text.
How it connects: The connector uses the YouTube Data API v3 to list videos from a channel or playlist. An API key is sufficient for listing public video metadata. For each video, the connector retrieves available caption tracks. If auto-generated or manually uploaded captions exist, the connector extracts the transcript text. If captions are not available, the connector downloads the video audio and runs it through a speech-to-text transcription service.
Example:
crawlers:
youtube_videos:
module: crawler.youtube.main.run
parameters:
credentials:
api_key: your-youtube-api-key
channel_id: UCxxxxxxxxxxxxxxxx
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | Authentication (see below) |
channel_id |
None |
YouTube channel ID to crawl |
playlist_id |
None |
YouTube playlist ID to crawl |
prefer_captions |
true |
Use existing captions instead of transcribing when available |
max_videos |
None |
Maximum number of videos to process |
metadata |
{} |
Additional metadata for all documents |
Credentials:
api_key: YouTube Data API v3 key, generated from the Google Cloud Console under APIs & Services > Credentialsoauth_client_id(optional): Required only for downloading caption tracks from the Captions APIoauth_client_secret(optional): Required alongside oauth_client_id
Transcription process:
- The connector retrieves the channel's uploads playlist using the Channels API, then lists all videos through the PlaylistItems API (
playlistItems.list). Alternatively, it lists videos from a specific playlist directly. - For each video, the connector retrieves available caption tracks and extracts the transcript text
- If auto-generated or manual captions are available, the connector downloads the caption content in SRT or VTT format
- If captions are not available, the connector downloads the video audio and transcribes it using an integrated speech-to-text service
- The transcribed or captioned text is indexed alongside video metadata (title, description, channel, publish date)
Note
The YouTube Data API v3 must be enabled in your Google Cloud project. The API enforces a default quota of 10,000 units per day. Listing videos costs 1 unit per request, while search operations cost 100 units each.
YouTrack
Indexes issues from JetBrains YouTrack.
Example:
crawlers:
youtrack:
module: crawler.youtrack.main.run
parameters:
credentials:
base_url: https://youtrack.example.com
token: your-api-token
projects: [PROJ1, PROJ2]
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
{} |
Authentication (base_url and token required) |
projects |
None |
Project IDs to crawl (None = all projects) |
max_issues_per_project |
None |
Limit issues per project |
max_comments_per_issues |
None |
Limit comments per issue |
proxies |
None |
Proxy configuration |
metadata |
{} |
Additional metadata for all documents |
Zoomin
Indexes technical documentation from the Zoomin knowledge delivery platform. Zoomin aggregates content from multiple authoring tools and content management systems into a unified, searchable documentation portal. The connector uses the Zoomin API to retrieve published content.
How it connects: The connector authenticates against the Zoomin API using an API token provided by your Zoomin account team. It calls the search and content retrieval endpoints to enumerate and download documents, topics, and knowledge articles from your Zoomin portal. The API host follows the pattern api.<your-portal>.zoominsoftware.com.
Example:
crawlers:
zoomin_docs:
module: crawler.zoomin.main.run
parameters:
credentials:
base_url: https://api.docs.example.zoominsoftware.com
api_token: your-api-token
Parameters:
| Parameter | Default | Description |
|---|---|---|
credentials |
Required | Authentication (base_url and api_token required) |
metadata |
{} |
Additional metadata for all documents |
Credentials:
base_url: Zoomin API base URL (provided by your Zoomin account team)api_token: API token for authentication (passed asAuthorization: Bearer <token>)
Note
Zoomin API credentials and endpoint details are provisioned per customer. Contact your Zoomin account representative to obtain the API base URL and authentication token for your portal.
Custom Data Sources
To create a custom data source, implement a run function in a Python module and reference it in the configuration:
# my_custom_crawler.py
from crawler.struct import RunCrawlerConfig, ParsedDocument
def run(conf: RunCrawlerConfig):
# Your crawler implementation
for doc in your_data_source:
yield ParsedDocument(
title=doc.title,
text=doc.content,
url=doc.url,
breadcrumbs=doc.path,
metadata=doc.metadata
)
# config.yaml
crawlers:
custom_source:
module: my_custom_crawler.run
parameters:
# Your custom parameters
For more examples, see: SerenityGPT-examples
Do not see your data source?
There are more data sources available beyond those documented here. Please contact us for documentation and enablement of additional integrations tailored to your specific needs.