Skip to main content
Indexers send processed documents to search engines or other destinations. Lucille currently supports one indexer per configuration, which is used for all pipelines.

Indexer Structure

The indexer configuration consists of two parts:
  1. General indexer settings - control batching, field handling, and deletion behavior
  2. Backend-specific settings - connection details for Solr, OpenSearch, Elasticsearch, etc.
indexer {
  type: "Solr"  # or "OpenSearch", "Elasticsearch", "CSV"
  batchSize: 100
  batchTimeout: 6000
  # ... other general settings
}

# Backend-specific configuration
solr {
  url: "http://localhost:8983/solr/collection1"
}

General Indexer Settings

indexer.type
string
Indexer type: Solr, OpenSearch, Elasticsearch, or CSVCan be omitted if you provide indexer.class instead.
indexer.class
string
Fully qualified indexer implementation class. Use for plugins and custom implementations.
indexer {
  class: "com.kmwllc.lucille.pinecone.indexer.PineconeIndexer"
}
indexer.batchSize
number
default:"100"
Maximum number of documents in a batch before it is flushed to the destination
indexer.batchTimeout
number
default:"100"
Milliseconds since the previous add or flush before a batch is considered expired and flushed regardless of size
indexer.sendEnabled
boolean
default:"true"
Enable or disable indexing. Set to false for testing or when no indexer is required.

Field Handling

indexer.idOverrideField
string
Document field containing an ID to send to the index instead of the default document ID
indexer {
  idOverrideField: "custom_id"
}
indexer.indexOverrideField
string
Document field containing the destination index/collection name to use instead of the default
indexer {
  indexOverrideField: "target_index"
}
indexer.ignoreFields
list<string>
Fields that should never be sent to the destination
indexer {
  ignoreFields: ["temp_data", "internal_notes"]
}

Deletion Handling

Deletion features are supported in Solr, OpenSearch, and Pinecone indexers.
indexer.deletionMarkerField
string
Document field that indicates whether a document represents a deletion request
indexer {
  deletionMarkerField: "is_deleted"
  deletionMarkerFieldValue: "true"
}
indexer.deletionMarkerFieldValue
string
Value in deletionMarkerField that marks a document as a deletion request
indexer.deleteByFieldField
string
Document field containing the name of the field to use in a delete-by-query requestOnly supported in Solr and OpenSearch indexers.
indexer.deleteByFieldValue
string
Document field containing the value to match in a delete-by-query requestOnly supported in Solr and OpenSearch indexers.

Deletion Behavior

When a document has deletionMarkerField set to deletionMarkerFieldValue:
  1. Delete by ID (default): The document with the same ID is deleted from the index
  2. Delete by Query: If deleteByFieldField and deleteByFieldValue are also present, all documents matching that field/value are deleted
indexer {
  deletionMarkerField: "is_deleted"
  deletionMarkerFieldValue: "true"
}

# Document with id="doc123" and is_deleted="true"
# -> Deletes document with ID "doc123"

Solr Configuration

Lucille supports both basic Solr and SolrCloud configurations.

Basic Solr (HTTP2SolrClient)

indexer {
  type: "Solr"
}

solr {
  url: "http://localhost:8983/solr/collection1"
}
solr.url
string
required
Solr URL including the collection name (e.g., http://localhost:8983/solr/collection1)

SolrCloud with URL

indexer {
  type: "Solr"
}

solr {
  useCloudClient: true
  url: ["http://localhost:8983/solr"]  # URL should NOT include collection
  defaultCollection: "collection2"
}
solr.useCloudClient
boolean
default:"false"
Use CloudHTTP2SolrClient for SolrCloud deployments
solr.url
list<string>
required
One or more Solr base URLs. For SolrCloud, URLs should NOT include the collection name.
solr.defaultCollection
string
Default collection name for SolrCloud. Required when using useCloudClient.

SolrCloud with ZooKeeper

indexer {
  type: "Solr"
}

solr {
  useCloudClient: true
  zkHosts: ["zookeeper1:2181", "zookeeper2:2181", "zookeeper3:2181"]
  zkChroot: "/solr"  # Optional
  defaultCollection: "collection3"
}
solr.zkHosts
list<string>
ZooKeeper connection strings for SolrCloud. Alternative to url.
solr.zkChroot
string
ZooKeeper chroot path for Solr

Authentication and SSL

solr.userName
string
Username for HTTP basic authentication
solr.password
string
Password for HTTP basic authentication
solr.acceptInvalidCert
boolean
default:"false"
Allow invalid TLS certificates. Only enable for testing SSL/HTTPS against localhost.

OpenSearch Configuration

indexer {
  type: "OpenSearch"
}

opensearch {
  url: "https://admin:admin@localhost:9200"
  index: "my-index"
  acceptInvalidCert: false
}
opensearch.url
string
required
OpenSearch HTTP endpoint. Can include credentials in the URL.Use environment variables for production:
opensearch {
  url: ${OPENSEARCH_URL}
}
opensearch.index
string
required
Target OpenSearch index nameCan be overridden per-document using indexer.indexOverrideField.
opensearch.update
boolean
default:"false"
Use partial update API instead of index/replace operation
opensearch.acceptInvalidCert
boolean
default:"false"
Allow invalid TLS certificates. Only enable for testing SSL/HTTPS against localhost.
indexer.routingField
string
Document field that supplies the routing key for OpenSearch
indexer.versionType
string
Versioning type when using external versions (e.g., EXTERNAL)

Environment Variable Example

indexer {
  type: "OpenSearch"
}

opensearch {
  url: "https://localhost:9200"
  url: ${?OPENSEARCH_URL}  # Override if env var is set
  index: "default-index"
  index: ${?OPENSEARCH_INDEX}
  acceptInvalidCert: false
}

Elasticsearch Configuration

indexer {
  type: "Elasticsearch"
}

elastic {
  url: "http://localhost:9200"
  index: "my-index"
  type: "lucille-type"
}
elastic.url
string
required
Elasticsearch HTTP endpoint
elastic.index
string
required
Target Elasticsearch index name
elastic.type
string
Document type (deprecated in newer Elasticsearch versions)

CSV Indexer

Outputs documents to CSV files instead of a search engine. Useful for testing and data export.
indexer {
  type: "CSV"
}
CSV indexer configuration is minimal. Documents are written to CSV format based on their fields.

Plugin Indexers

Lucille supports plugin indexers for additional destinations:

Pinecone Example

indexer {
  class: "com.kmwllc.lucille.pinecone.indexer.PineconeIndexer"
  batchSize: 100
  batchTimeout: 5000
}

pinecone {
  apiKey: ${PINECONE_API_KEY}
  environment: "us-west1-gcp"
  indexName: "my-vectors"
}

Weaviate Example

indexer {
  class: "com.kmwllc.lucille.weaviate.indexer.WeaviateIndexer"
}

weaviate {
  host: "localhost:8080"
  scheme: "http"
  className: "Document"
}

Complete Examples

indexer {
  type: "Solr"
  batchSize: 100
  batchTimeout: 6000
  
  # Don't send these fields
  ignoreFields: ["temp_field", "debug_info"]
  
  # Support deletion requests
  deletionMarkerField: "is_deleted"
  deletionMarkerFieldValue: "true"
}

solr {
  url: "http://localhost:8983/solr/documents"
}

Performance Tuning

Larger batch sizes reduce network overhead but increase memory usage:
  • Small batches (10-50): Lower latency, more network requests
  • Medium batches (100-250): Balanced for most use cases
  • Large batches (500-1000): Better throughput for bulk indexing
indexer {
  batchSize: 250  # Adjust based on document size and network
}
Controls how long to wait before flushing a partial batch:
  • Short timeout (100-1000ms): Lower latency, more frequent flushes
  • Long timeout (5000-10000ms): Better batching, higher latency
indexer {
  batchTimeout: 3000  # 3 seconds
}
Remove unnecessary fields before indexing to reduce payload size:
indexer {
  ignoreFields: [
    "temp_*",      # Temporary processing fields
    "debug_*",     # Debug information
    "raw_content"  # Original content after extraction
  ]
}

Troubleshooting

Verify the URL is correct and the search engine is accessible:
# Test Solr
curl http://localhost:8983/solr/

# Test OpenSearch
curl -k https://admin:admin@localhost:9200

# Test Elasticsearch
curl http://localhost:9200
Check acceptInvalidCert if using self-signed certificates.
Common causes:
  1. Indexing disabled: Check sendEnabled: true
  2. Wrong index: Verify collection/index name
  3. Field mapping issues: Check destination schema
  4. Deletion marker: Ensure documents don’t have deletion marker set
Enable debug logging to see indexing activity.
  • Increase batchSize for better throughput
  • Reduce batchTimeout if documents appear slowly
  • Check network latency to search engine
  • Monitor search engine resource usage

Next Steps

Kafka Configuration

Enable distributed mode with Kafka

Running Lucille

Execute your configured pipeline