Document Generation

This guide shows how to generate synthetic documents with realistic randomized data for testing your Lucille pipelines and search indices.

Overview

The document generation example demonstrates:

Creating synthetic documents with SequenceConnector
Generating random data (text, numbers, dates, booleans)
Building nested JSON structures
Testing pipelines without real data
Load testing search systems

Perfect for development, testing, and demonstration environments where you need realistic data at scale.

Use Cases

Pipeline Testing

Test your stage configurations with controlled data before processing real documents.

Load Testing

Generate millions of documents to test indexer performance and capacity.

Development

Work offline without access to production data sources.

Demos

Create realistic datasets for demonstrations and training.

Configuration

Complete Example

# Document generation example

connectors: [
  {
    # Generates N empty docs with only an ID
    class: "com.kmwllc.lucille.connector.SequenceConnector"
    name: "connector1"
    pipeline: "pipeline1"
    numDocs: 100
    numDocs: ${?NUM_DOCS}
  }
]

pipelines: [
  {
    name: "pipeline1"
    stages: [
      # Random text field
      {
        name: "test_text"
        class: "com.kmwllc.lucille.stage.AddRandomString"
        fieldName: "text_example"
        inputDataPath: "resources/example-dictionary.txt"
        minNumOfTerms: 10
        maxNumOfTerms: 200
        concatenate: true   # Join terms into single string
      },
      
      # Random numbers
      {
        name: "test_integer"
        class: "com.kmwllc.lucille.stage.AddRandomInt"
        fieldName: "integer_example"
        rangeStart: 0
        rangeEnd: 100000
      },
      {
        name: "test_double"
        class: "com.kmwllc.lucille.stage.AddRandomDouble"
        fieldName: "double_example"
        rangeStart: 10.3
        rangeEnd: 534.8
      },
      
      # Random boolean
      {
        name: "test_boolean"
        class: "com.kmwllc.lucille.stage.AddRandomBoolean"
        fieldName: "boolean_example"
        percentTrue: 60   # 60% true, 40% false
      },
      
      # Random date
      {
        name: "test_date"
        class: "com.kmwllc.lucille.stage.AddRandomDate"
        fieldName: "date_example"
        rangeStartDate: "2015-07-31"
        rangeEndDate: "2025-07-31"
      },
      
      # Random nested structure
      {
        name: "test_nested"
        class: "com.kmwllc.lucille.stage.AddRandomNestedField"
        targetField: "nested_example"
        minNumObjects: 1
        maxNumObjects: 20
        
        entries: {
          "field_a" = "stringGen"
          "field_b" = "stringGen"
          "field_c" = "stringGen"
        }
        
        generators: {
          stringGen = {
            class: "com.kmwllc.lucille.stage.AddRandomString"
            input_data_path: "resources/example-dictionary.txt"
            minNumOfTerms: 1
            maxNumOfTerms: 5
            concatenate: true
          }
        }
      }
    ]
  }
]

indexer {
  type: "Elasticsearch"
  batchSize: 1000
  batchTimeout: 1000
  logRate: 1000
  sendEnabled: true
  sendEnabled: ${?ES_SEND_ENABLED}
}

elasticsearch {
  url: "http://localhost:9200"
  url: ${?ES_URL}
  index: "test_docs"
  index: ${?ES_INDEX}
  acceptInvalidCert: true
}

Generator Stages

AddRandomString

Generate random text from a dictionary file:

{
  name: "randomText"
  class: "com.kmwllc.lucille.stage.AddRandomString"
  fieldName: "description"
  inputDataPath: "resources/dictionary.txt"
  minNumOfTerms: 5
  maxNumOfTerms: 50
  concatenate: true  # Join with spaces
}

Parameters:

inputDataPath: Text file with one term per line
minNumOfTerms: Minimum terms to select
maxNumOfTerms: Maximum terms to select
concatenate: Join terms into single string (default: false)
rangeSize: Use only first N terms from file

Create a custom dictionary file with domain-specific terms for more realistic test data.

AddRandomInt

Generate random integers:

{
  name: "randomInt"
  class: "com.kmwllc.lucille.stage.AddRandomInt"
  fieldName: "quantity"
  rangeStart: 1
  rangeEnd: 1000
}

AddRandomDouble

Generate random decimal numbers:

{
  name: "randomDouble"
  class: "com.kmwllc.lucille.stage.AddRandomDouble"
  fieldName: "price"
  rangeStart: 9.99
  rangeEnd: 999.99
}

AddRandomBoolean

Generate random boolean values:

{
  name: "randomBoolean"
  class: "com.kmwllc.lucille.stage.AddRandomBoolean"
  fieldName: "in_stock"
  percentTrue: 75  # 75% true, 25% false
}

AddRandomDate

Generate random dates in a range:

{
  name: "randomDate"
  class: "com.kmwllc.lucille.stage.AddRandomDate"
  fieldName: "created_date"
  rangeStartDate: "2020-01-01"  # yyyy-MM-dd
  rangeEndDate: "2024-12-31"
}

Output format: ISO 8601 (e.g., 2023-06-15T10:30:45.123Z)

AddRandomNestedField

Generate random nested JSON structures:

{
  name: "randomNested"
  class: "com.kmwllc.lucille.stage.AddRandomNestedField"
  targetField: "attributes"
  minNumObjects: 1
  maxNumObjects: 10
  
  # Map destination path to source or generator
  entries: {
    "name" = "stringGen"
    "value" = "numberGen"
  }
  
  # Generators for missing fields
  generators: {
    stringGen = {
      class: "com.kmwllc.lucille.stage.AddRandomString"
      input_data_path: "resources/attributes.txt"
      minNumOfTerms: 1
      maxNumOfTerms: 3
      concatenate: true
    }
    numberGen = {
      class: "com.kmwllc.lucille.stage.AddRandomInt"
      rangeStart: 1
      rangeEnd: 100
    }
  }
}

Output:

{
  "attributes": [
    {"name": "color", "value": 42},
    {"name": "size large", "value": 87},
    {"name": "weight", "value": 15}
  ]
}

Example Output Document

Generated document structure:

{
  "id": "2",
  "run_id": "4a274429-50c2-45f7-b2a7-a947bad8dd1a",
  "text_example": "exitance south-easterly subopaque psychopannychism moustached",
  "integer_example": 75947,
  "double_example": 342.04429969638477,
  "boolean_example": true,
  "date_example": "2021-01-30T18:34:43.983Z",
  "nested_example": [
    {
      "field_c": "pseudomucin Corum",
      "field_a": "fivebar acceptance's",
      "field_b": "microfibril"
    },
    {
      "field_c": "Tallbott",
      "field_a": "Campanulaceae clumsinesses",
      "field_b": "uncluttering Hartleyan"
    }
  ]
}

Running Document Generation

Choose Indexing Mode

With Indexer
Dry Run

Index generated docs to Elasticsearch:

export ES_URL="http://localhost:9200"
export ES_INDEX="test_docs"
export ES_SEND_ENABLED=true
export NUM_DOCS=1000

Generate without indexing:

export ES_SEND_ENABLED=false
export NUM_DOCS=100

Create Dictionary File

Create resources/example-dictionary.txt:

technology
innovation
software
development
engineering
architecture
performance
scalability
...

Build and Run

mvn clean package
./scripts/run_ingest.sh

Monitor progress:

INFO  SequenceConnector - Generating 1000 documents
INFO  Publisher - Published 1000 documents
INFO  Indexer - Indexed 1000 documents
INFO  Runner - Completed in 5 seconds

Advanced Patterns

Product Catalog

Generate realistic e-commerce data:

stages: [
  {
    name: "productName"
    class: "com.kmwllc.lucille.stage.AddRandomString"
    fieldName: "name"
    inputDataPath: "resources/products.txt"
    minNumOfTerms: 2
    maxNumOfTerms: 5
    concatenate: true
  },
  {
    name: "price"
    class: "com.kmwllc.lucille.stage.AddRandomDouble"
    fieldName: "price"
    rangeStart: 9.99
    rangeEnd: 999.99
  },
  {
    name: "inStock"
    class: "com.kmwllc.lucille.stage.AddRandomBoolean"
    fieldName: "in_stock"
    percentTrue: 80
  },
  {
    name: "category"
    class: "com.kmwllc.lucille.stage.SetField"
    fieldName: "category"
    values: ["Electronics", "Clothing", "Home", "Sports"]
    randomSelection: true
  }
]

User Profiles

Generate user data:

stages: [
  {
    name: "username"
    class: "com.kmwllc.lucille.stage.Concatenate"
    dest: "username"
    formatString: "user_{id}"
  },
  {
    name: "age"
    class: "com.kmwllc.lucille.stage.AddRandomInt"
    fieldName: "age"
    rangeStart: 18
    rangeEnd: 80
  },
  {
    name: "registrationDate"
    class: "com.kmwllc.lucille.stage.AddRandomDate"
    fieldName: "registered_at"
    rangeStartDate: "2020-01-01"
    rangeEndDate: "2024-12-31"
  },
  {
    name: "preferences"
    class: "com.kmwllc.lucille.stage.AddRandomNestedField"
    targetField: "preferences"
    minNumObjects: 3
    maxNumObjects: 10
    entries: {
      "key" = "prefKeyGen"
      "value" = "prefValueGen"
    }
    generators: {
      prefKeyGen = {
        class: "com.kmwllc.lucille.stage.AddRandomString"
        input_data_path: "resources/preference-keys.txt"
        minNumOfTerms: 1
        maxNumOfTerms: 1
      }
      prefValueGen = {
        class: "com.kmwllc.lucille.stage.AddRandomString"
        input_data_path: "resources/preference-values.txt"
        minNumOfTerms: 1
        maxNumOfTerms: 1
      }
    }
  }
]

Log Events

Generate time-series log data:

stages: [
  {
    name: "timestamp"
    class: "com.kmwllc.lucille.stage.AddRandomDate"
    fieldName: "timestamp"
    rangeStartDate: "2024-01-01"
    rangeEndDate: "2024-01-31"
  },
  {
    name: "level"
    class: "com.kmwllc.lucille.stage.SetField"
    fieldName: "level"
    values: ["INFO", "INFO", "INFO", "WARN", "ERROR"]  # Weighted
    randomSelection: true
  },
  {
    name: "message"
    class: "com.kmwllc.lucille.stage.AddRandomString"
    fieldName: "message"
    inputDataPath: "resources/log-messages.txt"
    minNumOfTerms: 5
    maxNumOfTerms: 20
    concatenate: true
  },
  {
    name: "responseTime"
    class: "com.kmwllc.lucille.stage.AddRandomInt"
    fieldName: "response_time_ms"
    rangeStart: 10
    rangeEnd: 5000
  }
]

Performance Considerations

High-Volume Generation

For millions of documents:

connectors: [
  {
    class: "com.kmwllc.lucille.connector.SequenceConnector"
    numDocs: 10000000  # 10 million
    pipeline: "pipeline1"
  }
]

worker {
  threads: 8  # Parallel processing
}

publisher {
  queueCapacity: 5000
}

indexer {
  batchSize: 5000
  batchTimeout: 5000
}

Large dictionary files and complex nested structures can slow generation. Profile your pipeline to identify bottlenecks.

Integration with Testing

JUnit Test Example

@Test
public void testPipelineWithSyntheticData() throws Exception {
    Config config = ConfigFactory.parseString("""
        connectors: [{
            class: "com.kmwllc.lucille.connector.SequenceConnector"
            numDocs: 100
            pipeline: "test"
        }]
        pipelines: [{
            name: "test"
            stages: [
                {
                    class: "com.kmwllc.lucille.stage.AddRandomString"
                    fieldName: "text"
                    rangeSize: 100
                    minNumOfTerms: 10
                    maxNumOfTerms: 50
                    concatenate: true
                },
                {
                    class: "com.kmwllc.lucille.stage.YourCustomStage"
                    // ... your stage config
                }
            ]
        }]
        indexer { type: "Mock" }
    """);
    
    Runner runner = new Runner(config);
    runner.run();
    
    // Assert your expectations
    assertEquals(100, mockIndexer.getIndexedCount());
}

Next Steps

Use generated data to test Custom Stages
Combine with S3 Ingestion for hybrid testing
Generate embeddings for Vector Search testing

Troubleshooting

Slow generation

Reduce complexity:

Use smaller dictionary files
Reduce nested object counts
Simplify generator configurations

Or increase parallelism:

worker {
  threads: 8
}

Memory issues

Reduce queue capacity:

publisher {
  queueCapacity: 100
}

Or process in smaller batches:

export NUM_DOCS=1000
./scripts/run_ingest.sh
# Run multiple times

Dictionary file not found

Ensure your dictionary file is in the correct location:

ls -la resources/example-dictionary.txt

Use absolute paths if needed:

inputDataPath: "/absolute/path/to/dictionary.txt"

Get Started

Core Concepts

Configuration

Deployment

Guides

Document Generation

Overview

Use Cases

Pipeline Testing

Load Testing

Development

Demos

Configuration

Complete Example

Generator Stages

AddRandomString

AddRandomInt

AddRandomDouble

AddRandomBoolean

AddRandomDate

AddRandomNestedField

Example Output Document

Running Document Generation

Advanced Patterns

Product Catalog

User Profiles

Log Events

Performance Considerations

High-Volume Generation

Integration with Testing

JUnit Test Example

Next Steps

Troubleshooting

Get Started

Core Concepts

Configuration

Deployment

Guides

​Overview

​Use Cases

Pipeline Testing

Load Testing

Development

Demos

​Configuration

​Complete Example

​Generator Stages

​AddRandomString

​AddRandomInt

​AddRandomDouble

​AddRandomBoolean

​AddRandomDate

​AddRandomNestedField

​Example Output Document

​Running Document Generation

​Advanced Patterns

​Product Catalog

​User Profiles

​Log Events

​Performance Considerations

​High-Volume Generation

​Integration with Testing

​JUnit Test Example

​Next Steps

​Troubleshooting

Overview

Use Cases

Configuration

Complete Example

Generator Stages

AddRandomString

AddRandomInt

AddRandomDouble

AddRandomBoolean

AddRandomDate

AddRandomNestedField

Example Output Document

Running Document Generation

Advanced Patterns

Product Catalog

User Profiles

Log Events

Performance Considerations

High-Volume Generation

Integration with Testing

JUnit Test Example

Next Steps

Troubleshooting