CSV Indexer

Overview

CSVIndexer writes processed documents to CSV (Comma-Separated Values) files. It’s primarily used for testing pipelines, exporting data for analysis, and creating human-readable output of document processing results. Location: com.kmwllc.lucille.indexer.CSVIndexer

Use Cases

Testing: Verify pipeline transformations by inspecting CSV output
Data Export: Extract processed data for external analysis tools
Debugging: Human-readable output for troubleshooting pipelines
Reporting: Generate reports from document processing results
File-to-File ETL: Transform data from one file format to CSV

Configuration

CSV-Specific Parameters

csv.path

String

required

Output CSV file path. Can be absolute or relative.

csv.columns

List<String>

required

Ordered list of document field names to write as CSV columns. Only these fields will be included in the output.

csv.includeHeader

Boolean

default:true

Whether to write a header row with column names when creating/opening the file.

csv.append

Boolean

default:false

Open the CSV file in append mode instead of overwriting. Useful for incremental exports.

Common Indexer Parameters

indexer.batchSize

Integer

default:100

Number of documents to buffer before writing to disk. Higher values improve throughput.

indexer.ignoreFields

List<String>

Fields to exclude from output (not commonly used with CSVIndexer since columns are explicitly specified).

CSVIndexer does not support indexer.indexOverrideField. Each indexer instance writes to a single CSV file.

Examples

Basic CSV Export

Export specific fields to a CSV file:

indexer {
  type: "CSVIndexer"
  batchSize: 1000
}

csv {
  path: "output/results.csv"
  columns: ["id", "title", "author", "publish_date"]
  includeHeader: true
}

Output results.csv:

id,title,author,publish_date
1,First Document,John Doe,2024-01-15
2,Second Document,Jane Smith,2024-01-16

File-to-File Transformation

Transform data from one format to CSV:

connectors: [
  {
    name: "json-source"
    class: "com.kmwllc.lucille.connector.FileConnector"
    pipeline: "transform-pipeline"
    paths: ["input/data.json"]
    fileOptions {
      json {
        filenameField: "source_file"
      }
    }
  }
]

pipelines: [
  {
    name: "transform-pipeline"
    stages: [
      {
        class: "com.kmwllc.lucille.stage.CopyFields"
        fieldMapping: {
          raw_title: "title"
          raw_author: "author"
        }
      },
      {
        class: "com.kmwllc.lucille.stage.DeleteFields"
        fields: ["raw_title", "raw_author"]
      }
    ]
  }
]

indexer {
  type: "CSVIndexer"
}

csv {
  path: "output/transformed.csv"
  columns: ["title", "author", "source_file"]
}

Append Mode for Incremental Exports

Append new documents to an existing CSV file:

csv {
  path: "output/cumulative.csv"
  columns: ["timestamp", "event_type", "user_id", "action"]
  includeHeader: true
  append: true
}

When using append mode, ensure the columns match the existing file’s structure. The header will not be written if the file already exists.

Testing Pipeline Output

Export documents with transformation stages for inspection:

connectors: [
  {
    name: "test-data"
    class: "com.kmwllc.lucille.connector.SequenceConnector"
    pipeline: "test-pipeline"
    numDocs: 100
  }
]

pipelines: [
  {
    name: "test-pipeline"
    stages: [
      {
        class: "com.kmwllc.lucille.stage.AddRandomString"
        field_name: "product_name"
        length: 20
      },
      {
        class: "com.kmwllc.lucille.stage.AddRandomInt"
        field_name: "price"
        min: 10
        max: 1000
      },
      {
        class: "com.kmwllc.lucille.stage.AddRandomBoolean"
        field_name: "in_stock"
        percent_true: 75
      }
    ]
  }
]

indexer {
  type: "CSVIndexer"
  batchSize: 50
}

csv {
  path: "test-output.csv"
  columns: ["id", "product_name", "price", "in_stock"]
}

Behavior

Field Handling

Missing fields: If a document lacks a specified column field, an empty value is written
Multi-valued fields: Arrays are written as comma-separated values within quoted strings
Nested fields: Complex objects are serialized to JSON strings
Field order: Columns appear in the CSV in the order specified in csv.columns

File Management

File creation: The CSV file is created when the indexer connects
Overwrite mode (default): Existing files are replaced
Append mode: New rows are added to existing files without a new header
Buffering: Documents are buffered according to batchSize before flushing to disk

CSV Format

CSVIndexer uses OpenCSV with these settings:

Separator: Comma (,)
Quote character: Double quote (")
Escape character: Backslash (\)
Line ending: System default (Windows: \r\n, Unix: \n)

Performance

Optimization Tips

Increase batch size: Higher batchSize reduces disk I/O
```
indexer.batchSize: 5000  # vs default 100
```

Limit columns: Only export fields you need

csv.columns: ["id", "title"]  # vs all fields

Use SSD storage: CSV writing benefits from fast sequential write speeds
Avoid nested objects: Flatten complex fields before indexing for better performance

Throughput

Typical performance characteristics:

Simple documents (5-10 fields): 10,000-50,000 docs/sec
Complex documents (50+ fields): 1,000-5,000 docs/sec
Bottleneck is usually disk I/O, not CPU

Common Patterns

Data Quality Validation

Export data for manual inspection:

csv {
  path: "validation/quality-check.csv"
  columns: [
    "id",
    "original_title",
    "cleaned_title",
    "validation_status"
  ]
}

Reporting Pipeline Results

Generate summary reports:

pipelines: [
  {
    stages: [
      {
        class: "com.kmwllc.lucille.stage.ComputeFieldSize"
        source: "content"
        dest: "content_length"
      },
      {
        class: "com.kmwllc.lucille.stage.DetectLanguage"
        source: "content"
        dest: "language"
      }
    ]
  }
]

csv {
  path: "reports/content-analysis.csv"
  columns: ["id", "content_length", "language", "source_file"]
}

A/B Testing Pipeline Configurations

Compare pipeline results side-by-side:

# Configuration A
csv {
  path: "comparison/pipeline-a.csv"
  columns: ["id", "result_field"]
}

# Configuration B
csv {
  path: "comparison/pipeline-b.csv"
  columns: ["id", "result_field"]
}

Troubleshooting

File Permission Errors

If you encounter “Permission denied” errors:

Ensure the output directory exists and is writable
Check file permissions if appending to existing files
Use absolute paths to avoid working directory issues

Memory Issues with Large Files

CSVIndexer buffers documents in memory:

Reduce indexer.batchSize if running out of memory
Process in smaller batches by splitting source data
Monitor heap usage with JVM flags

Column Order Mismatches

When appending to existing files:

Ensure csv.columns exactly matches the existing file structure
Consider using separate output files for different schemas
Validate existing CSV structure before running in append mode

Next Steps

SequenceConnector

Generate test documents for CSV export

Document Generation Guide

Create realistic test data with random stages

Solr Indexer

Index to Apache Solr for production search

Connectors

Stages

Indexers

Plugins

Overview

Use Cases

Configuration

CSV-Specific Parameters

Common Indexer Parameters

Examples

Basic CSV Export

File-to-File Transformation

Append Mode for Incremental Exports

Testing Pipeline Output

Behavior

Field Handling

File Management

CSV Format

Performance

Optimization Tips

Throughput

Common Patterns

Data Quality Validation

Reporting Pipeline Results

A/B Testing Pipeline Configurations

Troubleshooting

File Permission Errors

Memory Issues with Large Files

Column Order Mismatches

Next Steps

SequenceConnector

Document Generation Guide

Solr Indexer

Connectors

Stages

Indexers

Plugins

​Overview

​Use Cases

​Configuration

​CSV-Specific Parameters

​Common Indexer Parameters

​Examples

​Basic CSV Export

​File-to-File Transformation

​Append Mode for Incremental Exports

​Testing Pipeline Output

​Behavior

​Field Handling

​File Management

​CSV Format

​Performance

​Optimization Tips

​Throughput

​Common Patterns

​Data Quality Validation

​Reporting Pipeline Results

​A/B Testing Pipeline Configurations

​Troubleshooting

​File Permission Errors

​Memory Issues with Large Files

​Column Order Mismatches

​Next Steps

SequenceConnector

Document Generation Guide

Solr Indexer

Overview

Use Cases

Configuration

CSV-Specific Parameters

Common Indexer Parameters

Examples

Basic CSV Export

File-to-File Transformation

Append Mode for Incremental Exports

Testing Pipeline Output

Behavior

Field Handling

File Management

CSV Format

Performance

Optimization Tips

Throughput

Common Patterns

Data Quality Validation

Reporting Pipeline Results

A/B Testing Pipeline Configurations

Troubleshooting

File Permission Errors

Memory Issues with Large Files

Column Order Mismatches

Next Steps