Overview
A Connector is the entry point for data into Lucille. Connectors connect to source systems, read data, and generate Documents that will be processed through pipelines and eventually indexed into search engines.Connectors are responsible for data extraction only. All data transformation happens in the Pipeline through Stages.
Connector Lifecycle
Every Connector follows a well-defined four-phase lifecycle that ensures reliable data extraction:Lifecycle Phases
1. preExecute(runId)
1. preExecute(runId)
Performs setup operations before document generation begins.Common uses:Important: If
- Validate connection to source system
- Initialize resources (database connections, file handles)
- Record start time or checkpoint
- Verify required permissions
preExecute() throws an exception, the run fails immediately and execute() is never called.2. execute(publisher)
2. execute(publisher)
The main execution phase where documents are generated and published.Responsibilities:Important: Only called if
- Read data from the source system
- Create Document objects with appropriate IDs and fields
- Publish documents via the provided Publisher
- Handle pagination, batching, or streaming as needed
preExecute() succeeds. Exceptions cause the run to fail.3. postExecute(runId)
3. postExecute(runId)
Performs cleanup or finalization after all documents have been published.Common uses:Important: Only called if both
- Update checkpoints or high-water marks
- Record completion time or statistics
- Trigger downstream processes
- Move or archive processed files
preExecute() and execute() succeed.4. close()
4. close()
Always called to release resources, even if earlier phases failed.Responsibilities:Important: Always called in a
- Close database connections
- Release file handles
- Clean up temporary resources
- Close network connections
finally block, regardless of success or failure.Configuration
Connectors are configured in theconnectors array in your Lucille configuration:
Required Properties
class
Fully-qualified Java class name of the Connector implementation
name
Unique identifier for this connector (auto-generated if omitted)
pipeline
Name of the pipeline to send documents to
Optional Properties
- docIdPrefix: String prepended to all document IDs generated by this connector
- collapse: Whether to use a collapsing publisher (combines documents with same ID)
Common Patterns
Publishing Documents
The Publisher interface provides the mechanism for sending documents into the pipeline:Setting Run ID
Documents should include the run ID to enable tracking and correlation:Collapsing Publisher
Some connectors publish multiple documents with the same ID (e.g., one per field value). A collapsing publisher combines these into a single multi-valued document:Pagination
For large datasets, implement pagination to avoid memory issues:Checkpointing
Track what has been processed to enable incremental updates:Built-in Connectors
Lucille provides several ready-to-use connectors:Core Connectors
DatabaseConnector
Execute SQL queries and generate documents from result sets. Supports JDBC-compatible databases.
FileConnector
Read files from local or remote filesystems. Supports various file formats through FileHandlers.
SolrConnector
Query Solr and ingest existing documents. Useful for Solr-to-Solr migrations.
SequenceConnector
Generate a sequence of empty documents. Useful for testing and performance benchmarking.
RSSConnector
Fetch and parse RSS/Atom feeds into documents.
Plugin Connectors
ParquetConnector
Read Apache Parquet files and generate documents from rows. Available in
lucille-parquet plugin.Creating Custom Connectors
ExtendAbstractConnector to create your own connector:
Validation with Spec
All connectors must declare apublic static final Spec SPEC that defines their configuration properties:
- Validate configuration at startup
- Generate helpful error messages for missing/invalid properties
- Document available configuration options
Error Handling
Connector-Level Failures
Exceptions thrown from lifecycle methods cause the entire run to fail:Document-Level Failures
Failures publishing individual documents should typically be logged but not fail the entire connector:A connector is not considered failed if individual documents encounter errors during pipeline processing or indexing. Only exceptions from connector lifecycle methods cause run failure.
Best Practices
- Fail Fast in preExecute: Validate connections and required resources early
- Use Checkpoints: Track progress to enable incremental processing
- Handle Large Datasets: Use pagination or streaming to avoid memory issues
- Set Meaningful IDs: Document IDs should be stable and meaningful
- Include Run ID: Always set the run ID on generated documents
- Log Progress: Log milestones to aid debugging and monitoring
- Clean Up Resources: Always close connections and file handles in
close() - Document Configuration: Use Spec to declare and validate all properties
Next Steps
Documents
Learn about Document structure and fields
Pipelines
Understand how documents flow through pipelines
Built-in Connectors
Explore available connector implementations
Custom Connectors
Build your own connector