Overview
CSVIndexer writes processed documents to CSV (Comma-Separated Values) files. It’s primarily used for testing pipelines, exporting data for analysis, and creating human-readable output of document processing results.
Location: com.kmwllc.lucille.indexer.CSVIndexer
Use Cases
- Testing: Verify pipeline transformations by inspecting CSV output
- Data Export: Extract processed data for external analysis tools
- Debugging: Human-readable output for troubleshooting pipelines
- Reporting: Generate reports from document processing results
- File-to-File ETL: Transform data from one file format to CSV
Configuration
CSV-Specific Parameters
Output CSV file path. Can be absolute or relative.
Ordered list of document field names to write as CSV columns. Only these fields will be included in the output.
Whether to write a header row with column names when creating/opening the file.
Open the CSV file in append mode instead of overwriting. Useful for incremental exports.
Common Indexer Parameters
Number of documents to buffer before writing to disk. Higher values improve throughput.
Fields to exclude from output (not commonly used with CSVIndexer since columns are explicitly specified).
CSVIndexer does not support
indexer.indexOverrideField. Each indexer instance writes to a single CSV file.Examples
Basic CSV Export
Export specific fields to a CSV file:results.csv:
File-to-File Transformation
Transform data from one format to CSV:Append Mode for Incremental Exports
Append new documents to an existing CSV file:When using append mode, ensure the columns match the existing file’s structure. The header will not be written if the file already exists.
Testing Pipeline Output
Export documents with transformation stages for inspection:Behavior
Field Handling
- Missing fields: If a document lacks a specified column field, an empty value is written
- Multi-valued fields: Arrays are written as comma-separated values within quoted strings
- Nested fields: Complex objects are serialized to JSON strings
- Field order: Columns appear in the CSV in the order specified in
csv.columns
File Management
- File creation: The CSV file is created when the indexer connects
- Overwrite mode (default): Existing files are replaced
- Append mode: New rows are added to existing files without a new header
- Buffering: Documents are buffered according to
batchSizebefore flushing to disk
CSV Format
CSVIndexer uses OpenCSV with these settings:- Separator: Comma (
,) - Quote character: Double quote (
") - Escape character: Backslash (
\) - Line ending: System default (Windows:
\r\n, Unix:\n)
Performance
Optimization Tips
-
Increase batch size: Higher
batchSizereduces disk I/O -
Limit columns: Only export fields you need
- Use SSD storage: CSV writing benefits from fast sequential write speeds
- Avoid nested objects: Flatten complex fields before indexing for better performance
Throughput
Typical performance characteristics:- Simple documents (5-10 fields): 10,000-50,000 docs/sec
- Complex documents (50+ fields): 1,000-5,000 docs/sec
- Bottleneck is usually disk I/O, not CPU
Common Patterns
Data Quality Validation
Export data for manual inspection:Reporting Pipeline Results
Generate summary reports:A/B Testing Pipeline Configurations
Compare pipeline results side-by-side:Troubleshooting
File Permission Errors
If you encounter “Permission denied” errors:- Ensure the output directory exists and is writable
- Check file permissions if appending to existing files
- Use absolute paths to avoid working directory issues
Memory Issues with Large Files
CSVIndexer buffers documents in memory:- Reduce
indexer.batchSizeif running out of memory - Process in smaller batches by splitting source data
- Monitor heap usage with JVM flags
Column Order Mismatches
When appending to existing files:- Ensure
csv.columnsexactly matches the existing file structure - Consider using separate output files for different schemas
- Validate existing CSV structure before running in append mode
Next Steps
SequenceConnector
Generate test documents for CSV export
Document Generation Guide
Create realistic test data with random stages
Solr Indexer
Index to Apache Solr for production search