Kafka Configuration

When Lucille runs in hybrid or distributed mode, components communicate through Apache Kafka topics. This enables horizontal scaling and fault tolerance.

Execution Modes

Lucille supports three execution modes:

Local Mode

All components (connectors, workers, indexer) run in a single JVM. No Kafka required.

Hybrid Mode

Runner in one process, workers/indexers in separate processes. Uses Kafka for communication.

Distributed Mode

All components run as separate processes, communicating via Kafka. Fully scalable.

Kafka configuration is only required for hybrid and distributed modes.

Basic Kafka Configuration

kafka {
  bootstrapServers: "localhost:9092"
  consumerGroupId: "lucille_workers"
}

kafka.bootstrapServers

string

required

Comma-separated list of Kafka broker addresses

kafka {
  bootstrapServers: "broker1:9092,broker2:9092,broker3:9092"
}

kafka.consumerGroupId

string

default:"lucille_workers"

Consumer group ID that all Lucille workers belong toWorkers in the same group share the load of processing documents.

Kafka Topics

Lucille automatically creates and manages Kafka topics for document flow and event tracking.

Topic Types

Source Topics
Event Topics

Documents to be processed by a pipeline:

Default naming: {pipeline_name}_source
Custom naming: Use kafka.sourceTopic

kafka {
  sourceTopic: "pipeline1_source"
}

Document processing status (success/failure):

Default naming: {pipeline_name}_{run_id}_events
Custom naming: Use kafka.eventTopic (not recommended)

kafka {
  # WARNING: Use with caution
  # Breaks workflow tracking in batch mode
  eventTopic: "lucille_events"
}

Setting a custom eventTopic interferes with batch mode tracking because all runs share the same event topic instead of having distinct topics per pipeline/runId.

Source Topic Configuration

kafka.sourceTopic

string

Custom name for the topic containing documents to be processedIf not set, defaults to {pipeline_name}_source.

Event Topic Configuration

kafka.eventTopic

string

Custom name for the event topic

USE WITH CAUTION: This property should be omitted from most Lucille configs.When absent, Lucille creates a distinct event topic for each pipeline/runId, which is necessary for proper workflow tracking in batch mode. By specifying this property, a single event topic will be used independent of pipeline or runId, which interferes with tracking the status of any particular run.This setting can be safely used in streaming mode when a Worker/WorkerIndexer is reading directly from Kafka.

kafka.events

boolean

default:"true"

Enable/disable sending document success/failure events to Kafka

kafka {
  events: false  # Disable event tracking
}

Performance Settings

kafka.maxPollIntervalSecs

number

default:"600"

Maximum time (in seconds) allowed between Kafka polls before a consumer is evicted from the consumer groupIncrease this if processing individual documents takes a long time:

kafka {
  maxPollIntervalSecs: 1200  # 20 minutes
}

kafka.maxRequestSize

number

default:"250000000"

Maximum size of Kafka requests in bytes (approximately 250 MB)Increase for large documents:

kafka {
  maxRequestSize: 500000000  # 500 MB
}

Security Configuration

kafka.securityProtocol

string

Security protocol to use for Kafka connectionsCommon values:

PLAINTEXT - No encryption (default)
SSL - TLS encryption
SASL_PLAINTEXT - SASL authentication without encryption
SASL_SSL - SASL authentication with TLS

kafka {
  securityProtocol: "SSL"
}

Custom Serializers and Deserializers

kafka.documentSerializer

string

Custom serializer class for documentsDefaults to com.kmwllc.lucille.message.KafkaDocumentSerializer. Only set if you need a custom serializer.

kafka {
  documentSerializer: "com.example.CustomDocumentSerializer"
}

kafka.documentDeserializer

string

Custom deserializer class for documentsDefaults to com.kmwllc.lucille.message.KafkaDocumentDeserializer. Only set if you need a custom deserializer.

kafka {
  documentDeserializer: "com.example.CustomDocumentDeserializer"
}

Property Files

For advanced Kafka configuration, you can provide external property files:

kafka.consumerPropertyFile

string

Path to Kafka consumer properties file

kafka {
  consumerPropertyFile: "/etc/lucille/kafka-consumer.properties"
}

kafka.producerPropertyFile

string

Path to Kafka producer properties file

kafka {
  producerPropertyFile: "/etc/lucille/kafka-producer.properties"
}

kafka.adminPropertyFile

string

Path to Kafka admin client properties file

kafka {
  adminPropertyFile: "/etc/lucille/kafka-admin.properties"
}

ZooKeeper Configuration

Required when using worker.maxRetries to track document retry attempts:

worker {
  maxRetries: 2
}

zookeeper {
  connectString: "localhost:2181"
}

zookeeper.connectString

string

Connection string for ZooKeeper ensemble

zookeeper {
  connectString: "zk1:2181,zk2:2181,zk3:2181"
}

Complete Examples

Basic Distributed
Production Cluster
Streaming Mode
High Throughput

# Simple distributed setup with local Kafka
kafka {
  bootstrapServers: "localhost:9092"
  consumerGroupId: "lucille_workers"
  maxPollIntervalSecs: 600
}

worker {
  pipeline: "main-pipeline"
  threads: 4
}

# Production Kafka cluster with SSL
kafka {
  bootstrapServers: "kafka1:9092,kafka2:9092,kafka3:9092"
  consumerGroupId: "lucille_production"
  
  # Security
  securityProtocol: "SSL"
  
  # Performance
  maxPollIntervalSecs: 1200  # 20 minutes for large docs
  maxRequestSize: 500000000   # 500 MB
  
  # Custom topics
  sourceTopic: "documents_to_process"
  
  # Property files for advanced SSL config
  consumerPropertyFile: "/etc/lucille/consumer.properties"
  producerPropertyFile: "/etc/lucille/producer.properties"
  adminPropertyFile: "/etc/lucille/admin.properties"
}

worker {
  pipeline: "main-pipeline"
  threads: 8
  maxProcessingSecs: 900  # 15 minutes
  maxRetries: 3
  enableHeartbeat: true
}

zookeeper {
  connectString: "zk1:2181,zk2:2181,zk3:2181"
}

# Worker reading directly from external Kafka topic
kafka {
  bootstrapServers: "kafka:9092"
  consumerGroupId: "lucille_streaming"
  
  # Safe to use custom event topic in streaming mode
  eventTopic: "lucille_events"
  
  # Read from external topic
  sourceTopic: "external_data_stream"
}

worker {
  pipeline: "streaming-pipeline"
  threads: 4
}

# Optimized for high-volume processing
kafka {
  bootstrapServers: "kafka1:9092,kafka2:9092,kafka3:9092"
  consumerGroupId: "lucille_batch_workers"
  
  # Large messages
  maxRequestSize: 1000000000  # 1 GB
  
  # Extended processing time
  maxPollIntervalSecs: 1800  # 30 minutes
}

worker {
  pipeline: "batch-pipeline"
  threads: 16  # Many parallel workers
  maxProcessingSecs: 1200
}

publisher {
  # High capacity for in-flight documents
  maxPendingDocs: 100000
}

Publisher Configuration

The Publisher manages document flow between connectors and workers:

publisher.queueCapacity

number

default:"10000"

Maximum queue capacity in local mode:

Queue of published documents waiting to be processed
Queue of completed documents waiting to be indexed

Each queue can contain this many documents. Affects memory footprint.

publisher {
  queueCapacity: 20000  # Increase for better throughput
}

publisher.maxPendingDocs

number

Maximum pending documents in hybrid/distributed mode

Only use when Runner is a separate process from Workers/Indexer. Do NOT use in local mode (where queueCapacity controls this).

Causes publication to wait until pending docs fall below this max. This is a non-strict limit and can be exceeded by the number of threads calling publish().

publisher {
  maxPendingDocs: 80000
}

Distributed Mode Architecture

┌─────────────┐
│   Runner    │ ──┐
│ (Connector) │   │
└─────────────┘   │
                  ├─► Kafka Topic: pipeline1_source
┌─────────────┐   │
│   Worker 1  │ ◄─┤
│ (Pipeline)  │   │
└─────────────┘   │
                  │
┌─────────────┐   │
│   Worker 2  │ ◄─┤
│ (Pipeline)  │   │
└─────────────┘   │
                  │
┌─────────────┐   │
│   Worker 3  │ ◄─┤
│ (Pipeline)  │   │
└─────────────┘   │
                  │
                  ├─► Kafka Topic: pipeline1_{runId}_events
                  │
┌─────────────┐   │
│   Indexer   │ ◄─┘
└─────────────┘

Component Responsibilities

Runner

Executes connectors to publish documents
Publishes documents to Kafka source topic
Tracks completion via Kafka event topic
Outputs run summary when complete

Launched with: java -jar lucille.jar run config.conf

Workers

Subscribe to Kafka source topic
Process documents through pipeline stages
Send processed documents to indexer
Publish success/failure events

Must specify worker.pipeline in config.Launched with: java -jar lucille.jar worker config.conf

Indexer

Receives processed documents from workers
Batches and sends to search engine
Publishes completion events to Kafka

Launched with: java -jar lucille.jar indexer config.conf

Troubleshooting

Consumer evicted from group

Error: Consumer is not subscribed to the topic or partition does not existCause: maxPollIntervalSecs is too short for document processing timeSolution: Increase the timeout:

kafka {
  maxPollIntervalSecs: 1200  # 20 minutes
}

Message too large

Error: The message is too largeCause: Document exceeds maxRequestSizeSolution: Increase the limit:

kafka {
  maxRequestSize: 500000000  # 500 MB
}

No workers processing

Symptoms: Documents published but not processedChecklist:

Verify workers are running: ps aux | grep lucille
Check worker.pipeline matches a configured pipeline
Verify kafka.bootstrapServers is accessible
Check Kafka consumer group: kafka-consumer-groups.sh --describe --group lucille_workers

Duplicate processing

Cause: Workers in different consumer groupsSolution: Ensure all workers use the same consumerGroupId:

kafka {
  consumerGroupId: "lucille_workers"  # Same for all workers
}

Performance Tuning

Throughput
Latency
Reliability

Optimize for maximum document throughput:

kafka {
  maxRequestSize: 1000000000  # Large messages
  maxPollIntervalSecs: 1800   # Extended timeout
}

worker {
  threads: 16  # Many parallel workers
}

publisher {
  maxPendingDocs: 100000  # High capacity
}

indexer {
  batchSize: 500  # Large batches
  batchTimeout: 10000  # Wait for full batches
}

Optimize for low-latency processing:

kafka {
  maxPollIntervalSecs: 300  # Shorter timeout
}

worker {
  threads: 4  # Fewer threads, less contention
}

indexer {
  batchSize: 50  # Small batches
  batchTimeout: 1000  # Flush quickly
}

Optimize for fault tolerance:

kafka {
  bootstrapServers: "kafka1:9092,kafka2:9092,kafka3:9092"
  maxPollIntervalSecs: 1200
}

worker {
  threads: 4
  maxRetries: 3  # Retry failed documents
  exitOnTimeout: true
  maxProcessingSecs: 900
  enableHeartbeat: true  # Monitor liveness
}

zookeeper {
  connectString: "zk1:2181,zk2:2181,zk3:2181"
}

Get Started

Core Concepts

Configuration

Deployment

Guides

Kafka Configuration

Execution Modes

Local Mode

Hybrid Mode

Distributed Mode

Basic Kafka Configuration

Kafka Topics

Topic Types

Source Topic Configuration

Event Topic Configuration

Performance Settings

Security Configuration

Custom Serializers and Deserializers

Property Files

ZooKeeper Configuration

Complete Examples

Publisher Configuration

Distributed Mode Architecture

Component Responsibilities

Troubleshooting

Performance Tuning

Next Steps

Running Lucille

Monitoring

Get Started

Core Concepts

Configuration

Deployment

Guides

​Execution Modes

Local Mode

Hybrid Mode

Distributed Mode

​Basic Kafka Configuration

​Kafka Topics

​Topic Types

​Source Topic Configuration

​Event Topic Configuration

​Performance Settings

​Security Configuration

​Custom Serializers and Deserializers

​Property Files

​ZooKeeper Configuration

​Complete Examples

​Publisher Configuration

​Distributed Mode Architecture

​Component Responsibilities

​Troubleshooting

​Performance Tuning

​Next Steps

Running Lucille

Monitoring

Execution Modes

Basic Kafka Configuration

Kafka Topics

Topic Types

Source Topic Configuration

Event Topic Configuration

Performance Settings

Security Configuration

Custom Serializers and Deserializers

Property Files

ZooKeeper Configuration

Complete Examples

Publisher Configuration

Distributed Mode Architecture

Component Responsibilities

Troubleshooting

Performance Tuning

Next Steps