Overview
Lucille is a production-grade Search ETL framework designed to extract data from source systems, transform it through configurable pipelines, and load it into search engines. The architecture is built around a clear separation of concerns with four primary components working together to process documents from source to destination.Search ETL is a category of ETL where data must be extracted, transformed, and loaded into a search engine. Lucille speaks the language of search, representing data as search-engine-ready Documents with support for enrichment, batching, routing, and versioning.
Core Components
Lucille’s architecture consists of four main components that form a document processing pipeline:Connectors
Extract data from source systems and generate Documents
Pipelines
Orchestrate sequences of transformation Stages
Stages
Perform individual transformations on Documents
Indexers
Batch and send processed Documents to search engines
Document Flow
Documents flow through the system in a well-defined sequence:- Connector reads data from a source and creates Documents
- Publisher sends Documents to a message queue or in-memory buffer
- Worker polls for Documents and processes them through the Pipeline
- Pipeline applies each Stage in sequence to transform the Document
- Indexer receives completed Documents and batches them
- Search Engine receives batched Documents for indexing
Child documents generated during pipeline processing are automatically sent downstream and handled by the same components.
Deployment Modes
Lucille supports multiple deployment modes to accommodate different scalability and complexity requirements:Local Mode
In Local mode, all components run as threads within a single JVM process. Communication happens through in-memory queues with no external dependencies. Characteristics:- All components (Connector, Worker, Indexer) run in the same JVM
- In-memory message passing (no Kafka required)
- Ideal for development, testing, and small-scale deployments
- Simple setup with minimal infrastructure
Test Mode
Test mode is similar to Local mode but bypasses the actual search engine and captures all message traffic for inspection. Characteristics:- Same threading model as Local mode
- Indexer bypasses actual search engine writes
- All messages stored in a
TestMessengerfor inspection - Perfect for unit and integration testing
Distributed Mode (Kafka)
In distributed mode, components run as separate processes and communicate through Apache Kafka. This enables horizontal scalability and fault tolerance. Characteristics:- Connector, Workers, and Indexers run as independent processes
- Communication via Kafka topics
- Multiple Workers can process documents in parallel
- Multiple Indexers can write to different search engines
- Supports high-volume, production deployments
Kafka Local Mode
A hybrid mode that runs all components as threads in one JVM but uses Kafka for communication. Useful for testing distributed behavior locally. Characteristics:- Threading model similar to Local mode
- Uses actual Kafka for message passing
- Tests distributed behavior without deploying separate processes
- Requires local Kafka instance
Messaging Architecture
Lucille uses a messenger abstraction to enable different communication strategies:LocalMessenger
LocalMessenger
In-memory queues for single-JVM deployments. Provides thread-safe concurrent access to documents and events without external dependencies.
KafkaMessenger
KafkaMessenger
Kafka-based messaging for distributed deployments. Documents and events flow through Kafka topics, enabling independent scaling of each component type.Key topics:
topic.documents.{pipeline}- Documents to be processedtopic.toindex.{pipeline}- Documents ready for indexingtopic.events.{pipeline}- Lifecycle events (CREATE, DROP, FINISH, FAIL)
TestMessenger
TestMessenger
Captures and stores all messages for testing purposes. Extends LocalMessenger with recording capabilities to enable assertions about document flow and transformations.
Component Lifecycle
Each component follows a well-defined lifecycle:Connector Lifecycle
Worker Lifecycle
Indexer Lifecycle
Run Management
A “Run” in Lucille represents a complete execution of one or more Connectors. Each run:- Has a unique Run ID (UUID) attached to all documents
- Executes Connectors sequentially (one must complete before next starts)
- Tracks completion of all documents (including generated children)
- Fails if any Connector lifecycle method throws an exception
- Succeeds only when all documents reach an end state (indexed or failed)
A Connector is not considered failed if individual documents encounter errors during pipeline processing or indexing. Only failures in
preExecute(), execute(), or postExecute() cause the run to fail.Parallelism and Concurrency
Lucille achieves parallelism at multiple levels:Worker-Level Parallelism
Multiple Worker threads or processes can consume documents from the same queue/topic, processing them through the pipeline concurrently.Stage-Level Concurrency
Each document flows through stages sequentially, but different documents can be in different stages simultaneously. Child documents generated by a stage flow through downstream stages only.Batch-Level Efficiency
Indexers batch multiple documents together before sending to the search engine, optimizing network utilization and search engine performance.Metrics and Observability
Lucille collects detailed metrics using the Dropwizard Metrics library:- Stage metrics: Processing time, document count, children generated, errors
- Worker metrics: Document processing time distribution
- Indexer metrics: Batching time, documents indexed, indexing rate
- Run metrics: Total time, connector duration, success/failure status
Configuration
Lucille uses HOCON (Human-Optimized Config Object Notation) for configuration, supporting:- Substitution variables and environment variables
- Includes and inheritance
- Type safety and validation
- Comments and readability
Next Steps
Documents
Learn about the Document data structure
Connectors
Understand how data is ingested
Pipelines
Explore pipeline architecture
Deployment
Deploy Lucille in production