Execution Modes
Lucille supports three execution modes:Local Mode
All components (connectors, workers, indexer) run in a single JVM. No Kafka required.
Hybrid Mode
Runner in one process, workers/indexers in separate processes. Uses Kafka for communication.
Distributed Mode
All components run as separate processes, communicating via Kafka. Fully scalable.
Kafka configuration is only required for hybrid and distributed modes.
Basic Kafka Configuration
Comma-separated list of Kafka broker addresses
Consumer group ID that all Lucille workers belong toWorkers in the same group share the load of processing documents.
Kafka Topics
Lucille automatically creates and manages Kafka topics for document flow and event tracking.Topic Types
- Source Topics
- Event Topics
Documents to be processed by a pipeline:
- Default naming:
{pipeline_name}_source - Custom naming: Use
kafka.sourceTopic
Source Topic Configuration
Custom name for the topic containing documents to be processedIf not set, defaults to
{pipeline_name}_source.Event Topic Configuration
Custom name for the event topic
Enable/disable sending document success/failure events to Kafka
Performance Settings
Maximum time (in seconds) allowed between Kafka polls before a consumer is evicted from the consumer groupIncrease this if processing individual documents takes a long time:
Maximum size of Kafka requests in bytes (approximately 250 MB)Increase for large documents:
Security Configuration
Security protocol to use for Kafka connectionsCommon values:
PLAINTEXT- No encryption (default)SSL- TLS encryptionSASL_PLAINTEXT- SASL authentication without encryptionSASL_SSL- SASL authentication with TLS
Custom Serializers and Deserializers
Custom serializer class for documentsDefaults to
com.kmwllc.lucille.message.KafkaDocumentSerializer. Only set if you need a custom serializer.Custom deserializer class for documentsDefaults to
com.kmwllc.lucille.message.KafkaDocumentDeserializer. Only set if you need a custom deserializer.Property Files
For advanced Kafka configuration, you can provide external property files:Path to Kafka consumer properties file
Path to Kafka producer properties file
Path to Kafka admin client properties file
ZooKeeper Configuration
Required when usingworker.maxRetries to track document retry attempts:
Connection string for ZooKeeper ensemble
Complete Examples
- Basic Distributed
- Production Cluster
- Streaming Mode
- High Throughput
Publisher Configuration
The Publisher manages document flow between connectors and workers:Maximum queue capacity in local mode:
- Queue of published documents waiting to be processed
- Queue of completed documents waiting to be indexed
Maximum pending documents in hybrid/distributed modeCauses publication to wait until pending docs fall below this max. This is a non-strict limit and can be exceeded by the number of threads calling publish().
Distributed Mode Architecture
Component Responsibilities
Runner
Runner
- Executes connectors to publish documents
- Publishes documents to Kafka source topic
- Tracks completion via Kafka event topic
- Outputs run summary when complete
java -jar lucille.jar run config.confWorkers
Workers
- Subscribe to Kafka source topic
- Process documents through pipeline stages
- Send processed documents to indexer
- Publish success/failure events
worker.pipeline in config.Launched with: java -jar lucille.jar worker config.confIndexer
Indexer
- Receives processed documents from workers
- Batches and sends to search engine
- Publishes completion events to Kafka
java -jar lucille.jar indexer config.confTroubleshooting
Consumer evicted from group
Consumer evicted from group
Error: Consumer is not subscribed to the topic or partition does not existCause:
maxPollIntervalSecs is too short for document processing timeSolution: Increase the timeout:Message too large
Message too large
Error: The message is too largeCause: Document exceeds
maxRequestSizeSolution: Increase the limit:No workers processing
No workers processing
Symptoms: Documents published but not processedChecklist:
- Verify workers are running:
ps aux | grep lucille - Check
worker.pipelinematches a configured pipeline - Verify
kafka.bootstrapServersis accessible - Check Kafka consumer group:
kafka-consumer-groups.sh --describe --group lucille_workers
Duplicate processing
Duplicate processing
Cause: Workers in different consumer groupsSolution: Ensure all workers use the same
consumerGroupId:Performance Tuning
- Throughput
- Latency
- Reliability
Optimize for maximum document throughput:
Next Steps
Running Lucille
Launch Lucille in different modes
Monitoring
Monitor distributed deployments