Overview
TheRSSConnector reads items from an RSS or Atom feed and publishes a Document for each item. It supports recency filtering, incremental refresh, and configurable ID generation.
Class: com.kmwllc.lucille.connector.RSSConnectorExtends:
AbstractConnector
Key Features
- Read RSS 2.0 and Atom feeds
- Filter by publication date recency
- Incremental refresh with configurable intervals
- Use GUID or generate UUID for Document IDs
- Extract all standard RSS item fields
- Handle enclosures (media attachments)
- Prevent duplicate processing in refresh mode
Class Signature
Configuration Parameters
Required Parameters
URL of the RSS or Atom feed to read.Supports both HTTP and HTTPS.Example:
"https://example.com/feed.xml", "http://news.site.com/rss"Optional Parameters
Whether to use the RSS item’s GUID as the Document ID.
true: Use the<guid>element value as Document ID (falls back to UUID if missing)false: Generate a new UUID for each item
true for stable IDs across runs.Duration string to include only items with a recent publication date.Uses HOCON duration format (e.g.,
"1h", "2d", "30m").Only items published within this duration before the current time are included.Examples:"1h": Only items from the last hour"24h"or"1d": Only items from the last day"7d": Only items from the last week
If an item has no
pubDate, it is included regardless of this setting.Total duration to run when using incremental refresh. Must be used with
refreshIncrement.Uses HOCON duration format.Example: "1h" (run for 1 hour), "30m" (run for 30 minutes)Interval between feed refreshes in incremental mode. Must be used with
runDuration.Uses HOCON duration format.Example: "5m" (refresh every 5 minutes), "30s" (refresh every 30 seconds)Document Fields
Each published Document includes these fields extracted from RSS items:Author of the item (from
<author> or <dc:creator>).List of categories (from
<category> elements).URL to comments page (from
<comments>).Full content of the item (from
<content:encoded> or <content>).Description or summary (from
<description> or <summary>).List of enclosures (media attachments). Each enclosure is a JSON object with:
type(String): MIME typeurl(String): URL to the medialength(Long, optional): Size in bytes
Globally unique identifier (from
<guid> or <id>).Whether the GUID is a permalink (from
<guid isPermaLink="...">).URL to the item (from
<link>).Title of the item (from
<title>).Publication date and time (from
<pubDate> or <published>).Fields are only added if present in the RSS item. Missing fields are omitted from the Document.
Configuration Examples
Basic RSS Feed
Recent Items Only
Incremental Refresh
Podcast Feed
UUID-Based IDs
Feed Reading Behavior
Single Read Mode (Default)
WhenrunDuration and refreshIncrement are not configured:
- Fetch the feed once
- Filter items by
pubDateCutoff(if configured) - Publish each item as a Document
- Complete
Incremental Refresh Mode
WhenrunDuration and refreshIncrement are configured:
- Fetch the feed
- Filter items by
pubDateCutoff(if configured) - Publish new items (not seen in previous refresh)
- Wait for
refreshIncrementduration - Repeat from step 1 until
runDurationelapsed
The
pubDateCutoff is recalculated on each refresh, so an item that was too old in one refresh may be processed in the next if its relative age changes.Publication Date Filtering
ThepubDateCutoff parameter filters items by comparing their publication date to the current time:
Examples
pubDateCutoff = “1h”, current time = 2024-01-15 10:00:00- Item A published at 09:30:00 → Included (30 minutes ago)
- Item B published at 08:45:00 → Excluded (75 minutes ago)
- Item C with no pubDate → Included (always included)
- Item A published on 2024-01-10 → Included (5 days ago)
- Item B published on 2024-01-01 → Excluded (14 days ago)
Missing Publication Dates
If an RSS item has nopubDate or published element:
- The item is always included regardless of
pubDateCutoff - A warning is logged if
pubDateCutoffis configured
Document ID Generation
The connector determines Document IDs based onuseGuidForDocID:
useGuidForDocID = true (Default)
- If item has a
<guid>or<id>, use it as the Document ID - If item has no GUID, generate a UUID and log a warning
- Apply
docIdPrefixif configured
"article-123" with docIdPrefix: "rss-" → Document ID "rss-article-123"
useGuidForDocID = false
- Generate a new UUID for each item
- Apply
docIdPrefixif configured
"f47ac10b-58cc-4372-a567-0e02b2c3d479" with docIdPrefix: "rss-" → Document ID "rss-f47ac10b-58cc-4372-a567-0e02b2c3d479"
Enclosure Handling
RSS enclosures (typically used for podcasts and media) are converted to JSON objects: RSS:enclosures field:
Incremental Refresh Details
Tracking Seen Items
In incremental mode, the connector maintains a set of items from the previous refresh:- Refresh 1: Process items A, B, C → Store in memory
- Wait: Sleep for
refreshIncrement - Refresh 2: Fetch feed, items B, C, D → Skip B and C (seen before), process D → Store B, C, D
- Wait: Sleep for
refreshIncrement - Refresh 3: Continue…
Item Comparison
Items are compared using the RSS library’sItem.equals() method, which typically compares:
- GUID
- Title
- Link
- Publication date
The set of seen items is reset on each refresh to prevent unbounded memory growth. Only items from the immediately previous refresh are tracked.
Graceful Interruption
The connector can be interrupted during the wait period:Performance Considerations
Feed Fetch Frequency
In incremental mode, chooserefreshIncrement based on:
- Feed update frequency
- Network latency
- Processing time per item
- Politeness to the feed provider
<ttl> element if present).
Memory Usage
In incremental mode, the connector stores items from the previous refresh in memory:Network Errors
The connector does not retry failed feed fetches. If a fetch fails:- In single read mode, the connector throws
ConnectorException - In incremental mode, the error is thrown and the connector stops
Error Handling
Malformed Feed URL
IfrssURL is not a valid URL, an IllegalArgumentException is thrown during connector construction.
Feed Fetch Errors
If fetching the feed fails (network error, 404, etc.), aConnectorException is thrown:
Invalid runDuration/refreshIncrement
If only one ofrunDuration or refreshIncrement is configured, an IllegalArgumentException is thrown:
Lifecycle Methods
execute(Publisher publisher)
Single Read Mode:- Fetch RSS feed
- For each item:
- Check
pubDateCutoff(skip if too old) - Create Document
- Publish Document
- Check
- Complete
- Record start time
- Loop:
- Recalculate
pubDateCutoff(if configured) - Fetch RSS feed
- For each item:
- Check
pubDateCutoff(skip if too old) - Check if seen in previous refresh (skip if yes)
- Create Document
- Publish Document
- Check
- Store items for next iteration
- Check if
runDurationexceeded → exit if yes - Sleep for
refreshIncrement - Repeat
- Recalculate
Use Cases
News Aggregation
Podcast Indexing
Real-Time Monitoring
Blog Content Sync
Limitations
- Only supports RSS 2.0 and Atom feeds (RSS 1.0 may work but is not explicitly supported)
- No authentication support (feeds must be publicly accessible)
- No support for feed autodiscovery
- Does not follow redirects automatically (ensure
rssURLis the final URL) - Cannot process multiple feeds in a single connector (create separate connectors)
- Item comparison in incremental mode depends on the RSS library’s
equals()implementation
Next Steps
FileConnector
Traverse local and cloud storage
Connectors Overview
Learn about the Connector lifecycle