What are Indexers?
Indexers are responsible for sending processed documents from Lucille to their final destination - typically a search engine or vector database. After documents have been extracted from sources and transformed by stages, indexers handle the actual writing of data to external systems.Core Concepts
Indexer Interface
All indexers extend theIndexer base class and implement:
- Connection validation: Verify connectivity to the target system
- Batch processing: Send documents in configurable batches for efficiency
- Error handling: Track failed documents and provide detailed error messages
- Connection management: Properly open and close connections
Common Configuration
All indexers support these common configuration parameters:Number of documents to send in each batch request
List of document fields to exclude from indexing
Document field to use as the ID instead of the default document ID
Document field that specifies which index/collection to send the document to
Field name that marks a document for deletion
Value that indicates the document should be deleted
Available Indexers
Search Engine Indexers
Solr
Index to Apache Solr with support for SolrCloud
OpenSearch
Send documents to OpenSearch clusters
Elasticsearch
Index to Elasticsearch with join support
Vector Database Indexers
Pinecone
Vector database indexer for embeddings
Weaviate
Vector search with object-based schema
Utility Indexers
CSV Indexer
Export documents to CSV files for testing and data export
Deletion Handling
Indexers support marking documents for deletion using marker fields:Batch Processing
Indexers process documents in batches for efficiency:- Documents accumulate until batch size is reached
- Batch is sent to the target system
- Failed documents are tracked and can be retried
- Successful documents are acknowledged
Error Handling
Indexers track failed documents and return detailed error information:- Document ID
- Error message from the target system
- Ability to continue processing other documents
- Optional retry logic
Connection Validation
Before processing begins, indexers validate connectivity:- Ping the target system
- Verify authentication credentials
- Check cluster health (for distributed systems)
- Validate index/collection existence
Best Practices
Choose appropriate batch sizes
Choose appropriate batch sizes
- Larger batches (500-1000) for high-throughput scenarios
- Smaller batches (50-100) when documents are large or processing is complex
- Monitor memory usage and adjust accordingly
Handle failures gracefully
Handle failures gracefully
- Use deletion markers for removing documents
- Configure retry logic for transient failures
- Monitor failed document counts
- Set up alerts for indexing errors
Optimize field mappings
Optimize field mappings
- Use
ignoreFieldsto exclude unnecessary data - Map document fields to index schema appropriately
- Consider field size limits in the target system
Connection management
Connection management
- Keep connections open during pipeline execution
- Configure appropriate timeouts
- Use connection pooling when available
- Implement proper shutdown procedures
Next Steps
Solr Indexer
Configure Apache Solr indexing
OpenSearch Indexer
Set up OpenSearch indexing