Indexer Configuration

Indexers send processed documents to search engines or other destinations. Lucille currently supports one indexer per configuration, which is used for all pipelines.

Indexer Structure

The indexer configuration consists of two parts:

General indexer settings - control batching, field handling, and deletion behavior
Backend-specific settings - connection details for Solr, OpenSearch, Elasticsearch, etc.

indexer {
  type: "Solr"  # or "OpenSearch", "Elasticsearch", "CSV"
  batchSize: 100
  batchTimeout: 6000
  # ... other general settings
}

# Backend-specific configuration
solr {
  url: "http://localhost:8983/solr/collection1"
}

General Indexer Settings

indexer.type

string

Indexer type: Solr, OpenSearch, Elasticsearch, or CSVCan be omitted if you provide indexer.class instead.

indexer.class

string

Fully qualified indexer implementation class. Use for plugins and custom implementations.

indexer {
  class: "com.kmwllc.lucille.pinecone.indexer.PineconeIndexer"
}

indexer.batchSize

number

default:"100"

Maximum number of documents in a batch before it is flushed to the destination

indexer.batchTimeout

number

default:"100"

Milliseconds since the previous add or flush before a batch is considered expired and flushed regardless of size

indexer.sendEnabled

boolean

default:"true"

Enable or disable indexing. Set to false for testing or when no indexer is required.

Field Handling

indexer.idOverrideField

string

Document field containing an ID to send to the index instead of the default document ID

indexer {
  idOverrideField: "custom_id"
}

indexer.indexOverrideField

string

Document field containing the destination index/collection name to use instead of the default

indexer {
  indexOverrideField: "target_index"
}

indexer.ignoreFields

list<string>

Fields that should never be sent to the destination

indexer {
  ignoreFields: ["temp_data", "internal_notes"]
}

Deletion Handling

Deletion features are supported in Solr, OpenSearch, and Pinecone indexers.

indexer.deletionMarkerField

string

Document field that indicates whether a document represents a deletion request

indexer {
  deletionMarkerField: "is_deleted"
  deletionMarkerFieldValue: "true"
}

indexer.deletionMarkerFieldValue

string

Value in deletionMarkerField that marks a document as a deletion request

indexer.deleteByFieldField

string

Document field containing the name of the field to use in a delete-by-query requestOnly supported in Solr and OpenSearch indexers.

indexer.deleteByFieldValue

string

Document field containing the value to match in a delete-by-query requestOnly supported in Solr and OpenSearch indexers.

Deletion Behavior

When a document has deletionMarkerField set to deletionMarkerFieldValue:

Delete by ID (default): The document with the same ID is deleted from the index
Delete by Query: If deleteByFieldField and deleteByFieldValue are also present, all documents matching that field/value are deleted

indexer {
  deletionMarkerField: "is_deleted"
  deletionMarkerFieldValue: "true"
}

# Document with id="doc123" and is_deleted="true"
# -> Deletes document with ID "doc123"

Solr Configuration

Lucille supports both basic Solr and SolrCloud configurations.

Basic Solr (HTTP2SolrClient)

indexer {
  type: "Solr"
}

solr {
  url: "http://localhost:8983/solr/collection1"
}

solr.url

string

required

Solr URL including the collection name (e.g., http://localhost:8983/solr/collection1)

SolrCloud with URL

indexer {
  type: "Solr"
}

solr {
  useCloudClient: true
  url: ["http://localhost:8983/solr"]  # URL should NOT include collection
  defaultCollection: "collection2"
}

solr.useCloudClient

boolean

default:"false"

Use CloudHTTP2SolrClient for SolrCloud deployments

solr.url

list<string>

required

One or more Solr base URLs. For SolrCloud, URLs should NOT include the collection name.

solr.defaultCollection

string

Default collection name for SolrCloud. Required when using useCloudClient.

SolrCloud with ZooKeeper

indexer {
  type: "Solr"
}

solr {
  useCloudClient: true
  zkHosts: ["zookeeper1:2181", "zookeeper2:2181", "zookeeper3:2181"]
  zkChroot: "/solr"  # Optional
  defaultCollection: "collection3"
}

solr.zkHosts

list<string>

ZooKeeper connection strings for SolrCloud. Alternative to url.

solr.zkChroot

string

ZooKeeper chroot path for Solr

Authentication and SSL

solr.userName

string

Username for HTTP basic authentication

solr.password

string

Password for HTTP basic authentication

solr.acceptInvalidCert

boolean

default:"false"

Allow invalid TLS certificates. Only enable for testing SSL/HTTPS against localhost.

OpenSearch Configuration

indexer {
  type: "OpenSearch"
}

opensearch {
  url: "https://admin:admin@localhost:9200"
  index: "my-index"
  acceptInvalidCert: false
}

opensearch.url

string

required

OpenSearch HTTP endpoint. Can include credentials in the URL.Use environment variables for production:

opensearch {
  url: ${OPENSEARCH_URL}
}

opensearch.index

string

required

Target OpenSearch index nameCan be overridden per-document using indexer.indexOverrideField.

opensearch.update

boolean

default:"false"

Use partial update API instead of index/replace operation

opensearch.acceptInvalidCert

boolean

default:"false"

Allow invalid TLS certificates. Only enable for testing SSL/HTTPS against localhost.

indexer.routingField

string

Document field that supplies the routing key for OpenSearch

indexer.versionType

string

Versioning type when using external versions (e.g., EXTERNAL)

Environment Variable Example

indexer {
  type: "OpenSearch"
}

opensearch {
  url: "https://localhost:9200"
  url: ${?OPENSEARCH_URL}  # Override if env var is set
  index: "default-index"
  index: ${?OPENSEARCH_INDEX}
  acceptInvalidCert: false
}

Elasticsearch Configuration

indexer {
  type: "Elasticsearch"
}

elastic {
  url: "http://localhost:9200"
  index: "my-index"
  type: "lucille-type"
}

elastic.url

string

required

Elasticsearch HTTP endpoint

elastic.index

string

required

Target Elasticsearch index name

elastic.type

string

Document type (deprecated in newer Elasticsearch versions)

CSV Indexer

Outputs documents to CSV files instead of a search engine. Useful for testing and data export.

indexer {
  type: "CSV"
}

CSV indexer configuration is minimal. Documents are written to CSV format based on their fields.

Plugin Indexers

Lucille supports plugin indexers for additional destinations:

Pinecone Example

indexer {
  class: "com.kmwllc.lucille.pinecone.indexer.PineconeIndexer"
  batchSize: 100
  batchTimeout: 5000
}

pinecone {
  apiKey: ${PINECONE_API_KEY}
  environment: "us-west1-gcp"
  indexName: "my-vectors"
}

Weaviate Example

indexer {
  class: "com.kmwllc.lucille.weaviate.indexer.WeaviateIndexer"
}

weaviate {
  host: "localhost:8080"
  scheme: "http"
  className: "Document"
}

Complete Examples

Solr Basic
SolrCloud
OpenSearch
Elasticsearch
Testing (No Indexing)

indexer {
  type: "Solr"
  batchSize: 100
  batchTimeout: 6000
  
  # Don't send these fields
  ignoreFields: ["temp_field", "debug_info"]
  
  # Support deletion requests
  deletionMarkerField: "is_deleted"
  deletionMarkerFieldValue: "true"
}

solr {
  url: "http://localhost:8983/solr/documents"
}

indexer {
  type: "Solr"
  batchSize: 250
  
  # Route to different collections per document
  indexOverrideField: "target_collection"
}

solr {
  useCloudClient: true
  zkHosts: [
    "zk1.example.com:2181",
    "zk2.example.com:2181",
    "zk3.example.com:2181"
  ]
  zkChroot: "/solr"
  defaultCollection: "main"
  
  # Authentication
  userName: ${SOLR_USER}
  password: ${SOLR_PASSWORD}
}

indexer {
  type: "OpenSearch"
  batchSize: 500
  batchTimeout: 3000
  
  # Custom ID field
  idOverrideField: "doc_id"
  
  # Routing for distributed indexing
  routingField: "tenant_id"
  
  # Support deletions
  deletionMarkerField: "deleted"
  deletionMarkerFieldValue: "true"
  deleteByFieldField: "delete_field"
  deleteByFieldValue: "delete_value"
}

opensearch {
  url: ${OPENSEARCH_URL}
  index: ${OPENSEARCH_INDEX}
  update: false
  acceptInvalidCert: false
}

indexer {
  type: "Elasticsearch"
  batchSize: 200
  ignoreFields: ["internal_field"]
}

elastic {
  url: "http://elasticsearch:9200"
  index: "documents"
  type: "doc"
}

indexer {
  type: "Solr"
  
  # Disable indexing for testing
  sendEnabled: false
}

# Solr config still required even when disabled
solr {
  url: "http://localhost:8983/solr/test"
}

Performance Tuning

Batch Size

Larger batch sizes reduce network overhead but increase memory usage:

Small batches (10-50): Lower latency, more network requests
Medium batches (100-250): Balanced for most use cases
Large batches (500-1000): Better throughput for bulk indexing

indexer {
  batchSize: 250  # Adjust based on document size and network
}

Batch Timeout

Controls how long to wait before flushing a partial batch:

Short timeout (100-1000ms): Lower latency, more frequent flushes
Long timeout (5000-10000ms): Better batching, higher latency

indexer {
  batchTimeout: 3000  # 3 seconds
}

Field Filtering

Remove unnecessary fields before indexing to reduce payload size:

indexer {
  ignoreFields: [
    "temp_*",      # Temporary processing fields
    "debug_*",     # Debug information
    "raw_content"  # Original content after extraction
  ]
}

Troubleshooting

Connection failures

Verify the URL is correct and the search engine is accessible:

# Test Solr
curl http://localhost:8983/solr/

# Test OpenSearch
curl -k https://admin:admin@localhost:9200

# Test Elasticsearch
curl http://localhost:9200

Check acceptInvalidCert if using self-signed certificates.

Documents not appearing

Common causes:

Indexing disabled: Check sendEnabled: true
Wrong index: Verify collection/index name
Field mapping issues: Check destination schema
Deletion marker: Ensure documents don’t have deletion marker set

Enable debug logging to see indexing activity.

Performance issues

Increase batchSize for better throughput
Reduce batchTimeout if documents appear slowly
Check network latency to search engine
Monitor search engine resource usage

Next Steps

Kafka Configuration

Enable distributed mode with Kafka

Running Lucille

Execute your configured pipeline

Get Started

Core Concepts

Configuration

Deployment

Guides

Indexer Configuration

Indexer Structure

General Indexer Settings

Field Handling

Deletion Handling

Deletion Behavior

Solr Configuration

Basic Solr (HTTP2SolrClient)

SolrCloud with URL

SolrCloud with ZooKeeper

Authentication and SSL

OpenSearch Configuration

Environment Variable Example

Elasticsearch Configuration

CSV Indexer

Plugin Indexers

Pinecone Example

Weaviate Example

Complete Examples

Performance Tuning

Troubleshooting

Next Steps

Kafka Configuration

Running Lucille

Get Started

Core Concepts

Configuration

Deployment

Guides

​Indexer Structure

​General Indexer Settings

​Field Handling

​Deletion Handling

​Deletion Behavior

​Solr Configuration

​Basic Solr (HTTP2SolrClient)

​SolrCloud with URL

​SolrCloud with ZooKeeper

​Authentication and SSL

​OpenSearch Configuration

​Environment Variable Example

​Elasticsearch Configuration

​CSV Indexer

​Plugin Indexers

​Pinecone Example

​Weaviate Example

​Complete Examples

​Performance Tuning

​Troubleshooting

​Next Steps

Kafka Configuration

Running Lucille

Indexer Structure

General Indexer Settings

Field Handling

Deletion Handling

Deletion Behavior

Solr Configuration

Basic Solr (HTTP2SolrClient)

SolrCloud with URL

SolrCloud with ZooKeeper

Authentication and SSL

OpenSearch Configuration

Environment Variable Example

Elasticsearch Configuration

CSV Indexer

Plugin Indexers

Pinecone Example

Weaviate Example

Complete Examples

Performance Tuning

Troubleshooting

Next Steps