What is Lucille?

Lucille is a production-grade Search ETL solution that helps you extract data from source systems, transform it, and load it into search engines like Apache Solr, Elasticsearch, OpenSearch, and vector databases like Pinecone and Weaviate.

Search ETL: A Specialized Category

Search ETL is a category of ETL problem where data must be extracted from a source system, transformed, and loaded into a search engine. Unlike traditional ETL, a Search ETL solution must:

Represent data in the form of search-engine-ready Documents
Know how to enrich Documents to support common search use cases
Follow best practices for interacting with search engines including support for batching, routing, and versioning

Why Lucille?

To be production-grade, a search ETL solution must be scalable, reliable, and easy to use. Lucille delivers:

Scalable Architecture

Parallel document processing and distributed execution for handling large data volumes

Observable & Reliable

Built-in metrics, logging, and error handling to monitor your data pipelines

Easy Configuration

Simple HOCON config files define your entire ETL workflow

Production-Hardened

Extensive test coverage and proven in challenging real-world deployments

Core Architecture

Lucille follows a three-stage architecture:

1. Connectors (Data Sources)

Connectors extract data from source systems and generate Documents. Lucille includes connectors for:

File systems: Local files, S3, GCP, Azure (CSV, JSON, XML, Parquet)
Databases: JDBC-compatible databases with SQL queries
APIs: RSS feeds, Kafka streams, custom REST endpoints
Search engines: Solr, OpenSearch (for re-indexing)

2. Pipelines (Transformation)

Pipelines consist of Stages that process and enrich Documents. Each stage performs a specific transformation:

Extract text from files (Tika, OCR)
Parse and normalize data
Generate embeddings (OpenAI, Ollama, Jlama)
Query external systems
Apply business logic

3. Indexers (Destinations)

Indexers send processed Documents to search engines and vector databases:

Lucene-based: Apache Solr, Elasticsearch, OpenSearch
Vector databases: Pinecone, Weaviate
File output: CSV, JSON for testing and validation

Simple Example

Here’s a complete Lucille configuration that reads a CSV file and indexes it to Solr:

# Read CSV files
connectors: [
  {
    class: "com.kmwllc.lucille.connector.FileConnector",
    paths: ["conf/songs.csv"],
    name: "connector1",
    pipeline: "pipeline1"
    fileHandlers: {
      csv: { }
    }
  }
]

# Transform data (empty pipeline = no transformations)
pipelines: [
  {
    name: "pipeline1",
    stages: []
  }
]

# Send to Solr
indexer {
  type: "Solr"
}

solr {
  useCloudClient: true
  defaultCollection: "quickstart"
  url: ["http://localhost:8983/solr"]
}

Key Features

Multiple Search Engine Support

Lucille abstracts away the differences between search engines, letting you focus on your data:

Apache Solr (Cloud and Standalone)
Elasticsearch
OpenSearch
Pinecone
Weaviate

Document Model

Lucille represents data as Documents - the fundamental unit of search. Each Document:

Has a unique ID
Contains fields (single or multi-valued)
Supports common data types (String, Boolean, Integer, Double, Float, Long, Instant, byte[])
Can have child documents for nested data
Can be dropped from indexing based on conditions

Flexible Deployment

Local Mode

All components run in a single JVM with in-memory queues - perfect for development and small datasets

Distributed Mode

Workers and Indexers run as separate processes communicating via Kafka - scales to handle massive data volumes

Use Cases

Enterprise Search

Index documents from file systems, databases, and content management systems to power enterprise search applications.

E-commerce Product Search

Extract product data from databases, enrich with images and embeddings, and index to support faceted search and recommendations.

Log and Event Analytics

Stream log data through Kafka, parse and normalize, then index to Elasticsearch for real-time monitoring and analytics.

Vector Search & RAG

Generate embeddings from text documents and index to Pinecone or Weaviate to power semantic search and retrieval-augmented generation (RAG) applications.

Data Migration

Re-index and transform existing search indexes when upgrading search engines or changing schema.

What’s Next?

Quickstart

Get Lucille running in 5 minutes with a simple CSV to Solr example

Installation

Install Lucille and set up your development environment

Lucille is developed and maintained by KMW Technology, a search and machine learning consultancy with deep expertise in enterprise search solutions.

Get Started

Core Concepts

Configuration

Deployment

Guides

Introduction to Lucille

What is Lucille?

Search ETL: A Specialized Category

Why Lucille?

Scalable Architecture

Observable & Reliable

Easy Configuration

Production-Hardened

Core Architecture

1. Connectors (Data Sources)

2. Pipelines (Transformation)

3. Indexers (Destinations)

Simple Example

Key Features

Multiple Search Engine Support

Document Model

Flexible Deployment

Local Mode

Distributed Mode

Use Cases

Enterprise Search

E-commerce Product Search

Log and Event Analytics

Vector Search & RAG

Data Migration

What’s Next?

Quickstart

Installation

Get Started

Core Concepts

Configuration

Deployment

Guides

​What is Lucille?

​Search ETL: A Specialized Category

​Why Lucille?

Scalable Architecture

Observable & Reliable

Easy Configuration

Production-Hardened

​Core Architecture

​1. Connectors (Data Sources)

​2. Pipelines (Transformation)

​3. Indexers (Destinations)

​Simple Example

​Key Features

​Multiple Search Engine Support

​Document Model

​Flexible Deployment

Local Mode

Distributed Mode

​Use Cases

​Enterprise Search

​E-commerce Product Search

​Log and Event Analytics

​Vector Search & RAG

​Data Migration

​What’s Next?

Quickstart

Installation

What is Lucille?

Search ETL: A Specialized Category

Why Lucille?

Core Architecture

1. Connectors (Data Sources)

2. Pipelines (Transformation)

3. Indexers (Destinations)

Simple Example

Key Features

Multiple Search Engine Support

Document Model

Flexible Deployment

Use Cases

Enterprise Search

E-commerce Product Search

Log and Event Analytics

Vector Search & RAG

Data Migration

What’s Next?