Skip to main content
When Lucille runs in hybrid or distributed mode, components communicate through Apache Kafka topics. This enables horizontal scaling and fault tolerance.

Execution Modes

Lucille supports three execution modes:

Local Mode

All components (connectors, workers, indexer) run in a single JVM. No Kafka required.

Hybrid Mode

Runner in one process, workers/indexers in separate processes. Uses Kafka for communication.

Distributed Mode

All components run as separate processes, communicating via Kafka. Fully scalable.
Kafka configuration is only required for hybrid and distributed modes.

Basic Kafka Configuration

kafka {
  bootstrapServers: "localhost:9092"
  consumerGroupId: "lucille_workers"
}
kafka.bootstrapServers
string
required
Comma-separated list of Kafka broker addresses
kafka {
  bootstrapServers: "broker1:9092,broker2:9092,broker3:9092"
}
kafka.consumerGroupId
string
default:"lucille_workers"
Consumer group ID that all Lucille workers belong toWorkers in the same group share the load of processing documents.

Kafka Topics

Lucille automatically creates and manages Kafka topics for document flow and event tracking.

Topic Types

Documents to be processed by a pipeline:
  • Default naming: {pipeline_name}_source
  • Custom naming: Use kafka.sourceTopic
kafka {
  sourceTopic: "pipeline1_source"
}

Source Topic Configuration

kafka.sourceTopic
string
Custom name for the topic containing documents to be processedIf not set, defaults to {pipeline_name}_source.

Event Topic Configuration

kafka.eventTopic
string
Custom name for the event topic
USE WITH CAUTION: This property should be omitted from most Lucille configs.When absent, Lucille creates a distinct event topic for each pipeline/runId, which is necessary for proper workflow tracking in batch mode. By specifying this property, a single event topic will be used independent of pipeline or runId, which interferes with tracking the status of any particular run.This setting can be safely used in streaming mode when a Worker/WorkerIndexer is reading directly from Kafka.
kafka.events
boolean
default:"true"
Enable/disable sending document success/failure events to Kafka
kafka {
  events: false  # Disable event tracking
}

Performance Settings

kafka.maxPollIntervalSecs
number
default:"600"
Maximum time (in seconds) allowed between Kafka polls before a consumer is evicted from the consumer groupIncrease this if processing individual documents takes a long time:
kafka {
  maxPollIntervalSecs: 1200  # 20 minutes
}
kafka.maxRequestSize
number
default:"250000000"
Maximum size of Kafka requests in bytes (approximately 250 MB)Increase for large documents:
kafka {
  maxRequestSize: 500000000  # 500 MB
}

Security Configuration

kafka.securityProtocol
string
Security protocol to use for Kafka connectionsCommon values:
  • PLAINTEXT - No encryption (default)
  • SSL - TLS encryption
  • SASL_PLAINTEXT - SASL authentication without encryption
  • SASL_SSL - SASL authentication with TLS
kafka {
  securityProtocol: "SSL"
}

Custom Serializers and Deserializers

kafka.documentSerializer
string
Custom serializer class for documentsDefaults to com.kmwllc.lucille.message.KafkaDocumentSerializer. Only set if you need a custom serializer.
kafka {
  documentSerializer: "com.example.CustomDocumentSerializer"
}
kafka.documentDeserializer
string
Custom deserializer class for documentsDefaults to com.kmwllc.lucille.message.KafkaDocumentDeserializer. Only set if you need a custom deserializer.
kafka {
  documentDeserializer: "com.example.CustomDocumentDeserializer"
}

Property Files

For advanced Kafka configuration, you can provide external property files:
kafka.consumerPropertyFile
string
Path to Kafka consumer properties file
kafka {
  consumerPropertyFile: "/etc/lucille/kafka-consumer.properties"
}
kafka.producerPropertyFile
string
Path to Kafka producer properties file
kafka {
  producerPropertyFile: "/etc/lucille/kafka-producer.properties"
}
kafka.adminPropertyFile
string
Path to Kafka admin client properties file
kafka {
  adminPropertyFile: "/etc/lucille/kafka-admin.properties"
}

ZooKeeper Configuration

Required when using worker.maxRetries to track document retry attempts:
worker {
  maxRetries: 2
}

zookeeper {
  connectString: "localhost:2181"
}
zookeeper.connectString
string
Connection string for ZooKeeper ensemble
zookeeper {
  connectString: "zk1:2181,zk2:2181,zk3:2181"
}

Complete Examples

# Simple distributed setup with local Kafka
kafka {
  bootstrapServers: "localhost:9092"
  consumerGroupId: "lucille_workers"
  maxPollIntervalSecs: 600
}

worker {
  pipeline: "main-pipeline"
  threads: 4
}

Publisher Configuration

The Publisher manages document flow between connectors and workers:
publisher.queueCapacity
number
default:"10000"
Maximum queue capacity in local mode:
  • Queue of published documents waiting to be processed
  • Queue of completed documents waiting to be indexed
Each queue can contain this many documents. Affects memory footprint.
publisher {
  queueCapacity: 20000  # Increase for better throughput
}
publisher.maxPendingDocs
number
Maximum pending documents in hybrid/distributed mode
Only use when Runner is a separate process from Workers/Indexer. Do NOT use in local mode (where queueCapacity controls this).
Causes publication to wait until pending docs fall below this max. This is a non-strict limit and can be exceeded by the number of threads calling publish().
publisher {
  maxPendingDocs: 80000
}

Distributed Mode Architecture

┌─────────────┐
│   Runner    │ ──┐
│ (Connector) │   │
└─────────────┘   │
                  ├─► Kafka Topic: pipeline1_source
┌─────────────┐   │
│   Worker 1  │ ◄─┤
│ (Pipeline)  │   │
└─────────────┘   │

┌─────────────┐   │
│   Worker 2  │ ◄─┤
│ (Pipeline)  │   │
└─────────────┘   │

┌─────────────┐   │
│   Worker 3  │ ◄─┤
│ (Pipeline)  │   │
└─────────────┘   │

                  ├─► Kafka Topic: pipeline1_{runId}_events

┌─────────────┐   │
│   Indexer   │ ◄─┘
└─────────────┘

Component Responsibilities

  • Executes connectors to publish documents
  • Publishes documents to Kafka source topic
  • Tracks completion via Kafka event topic
  • Outputs run summary when complete
Launched with: java -jar lucille.jar run config.conf
  • Subscribe to Kafka source topic
  • Process documents through pipeline stages
  • Send processed documents to indexer
  • Publish success/failure events
Must specify worker.pipeline in config.Launched with: java -jar lucille.jar worker config.conf
  • Receives processed documents from workers
  • Batches and sends to search engine
  • Publishes completion events to Kafka
Launched with: java -jar lucille.jar indexer config.conf

Troubleshooting

Error: Consumer is not subscribed to the topic or partition does not existCause: maxPollIntervalSecs is too short for document processing timeSolution: Increase the timeout:
kafka {
  maxPollIntervalSecs: 1200  # 20 minutes
}
Error: The message is too largeCause: Document exceeds maxRequestSizeSolution: Increase the limit:
kafka {
  maxRequestSize: 500000000  # 500 MB
}
Symptoms: Documents published but not processedChecklist:
  1. Verify workers are running: ps aux | grep lucille
  2. Check worker.pipeline matches a configured pipeline
  3. Verify kafka.bootstrapServers is accessible
  4. Check Kafka consumer group: kafka-consumer-groups.sh --describe --group lucille_workers
Cause: Workers in different consumer groupsSolution: Ensure all workers use the same consumerGroupId:
kafka {
  consumerGroupId: "lucille_workers"  # Same for all workers
}

Performance Tuning

Optimize for maximum document throughput:
kafka {
  maxRequestSize: 1000000000  # Large messages
  maxPollIntervalSecs: 1800   # Extended timeout
}

worker {
  threads: 16  # Many parallel workers
}

publisher {
  maxPendingDocs: 100000  # High capacity
}

indexer {
  batchSize: 500  # Large batches
  batchTimeout: 10000  # Wait for full batches
}

Next Steps

Running Lucille

Launch Lucille in different modes

Monitoring

Monitor distributed deployments