This guide shows how to generate synthetic documents with realistic randomized data for testing your Lucille pipelines and search indices.
Overview
The document generation example demonstrates:
- Creating synthetic documents with SequenceConnector
- Generating random data (text, numbers, dates, booleans)
- Building nested JSON structures
- Testing pipelines without real data
- Load testing search systems
Perfect for development, testing, and demonstration environments where you need realistic data at scale.
Use Cases
Pipeline Testing
Test your stage configurations with controlled data before processing real documents.
Load Testing
Generate millions of documents to test indexer performance and capacity.
Development
Work offline without access to production data sources.
Demos
Create realistic datasets for demonstrations and training.
Configuration
Complete Example
# Document generation example
connectors: [
{
# Generates N empty docs with only an ID
class: "com.kmwllc.lucille.connector.SequenceConnector"
name: "connector1"
pipeline: "pipeline1"
numDocs: 100
numDocs: ${?NUM_DOCS}
}
]
pipelines: [
{
name: "pipeline1"
stages: [
# Random text field
{
name: "test_text"
class: "com.kmwllc.lucille.stage.AddRandomString"
fieldName: "text_example"
inputDataPath: "resources/example-dictionary.txt"
minNumOfTerms: 10
maxNumOfTerms: 200
concatenate: true # Join terms into single string
},
# Random numbers
{
name: "test_integer"
class: "com.kmwllc.lucille.stage.AddRandomInt"
fieldName: "integer_example"
rangeStart: 0
rangeEnd: 100000
},
{
name: "test_double"
class: "com.kmwllc.lucille.stage.AddRandomDouble"
fieldName: "double_example"
rangeStart: 10.3
rangeEnd: 534.8
},
# Random boolean
{
name: "test_boolean"
class: "com.kmwllc.lucille.stage.AddRandomBoolean"
fieldName: "boolean_example"
percentTrue: 60 # 60% true, 40% false
},
# Random date
{
name: "test_date"
class: "com.kmwllc.lucille.stage.AddRandomDate"
fieldName: "date_example"
rangeStartDate: "2015-07-31"
rangeEndDate: "2025-07-31"
},
# Random nested structure
{
name: "test_nested"
class: "com.kmwllc.lucille.stage.AddRandomNestedField"
targetField: "nested_example"
minNumObjects: 1
maxNumObjects: 20
entries: {
"field_a" = "stringGen"
"field_b" = "stringGen"
"field_c" = "stringGen"
}
generators: {
stringGen = {
class: "com.kmwllc.lucille.stage.AddRandomString"
input_data_path: "resources/example-dictionary.txt"
minNumOfTerms: 1
maxNumOfTerms: 5
concatenate: true
}
}
}
]
}
]
indexer {
type: "Elasticsearch"
batchSize: 1000
batchTimeout: 1000
logRate: 1000
sendEnabled: true
sendEnabled: ${?ES_SEND_ENABLED}
}
elasticsearch {
url: "http://localhost:9200"
url: ${?ES_URL}
index: "test_docs"
index: ${?ES_INDEX}
acceptInvalidCert: true
}
Generator Stages
AddRandomString
Generate random text from a dictionary file:
{
name: "randomText"
class: "com.kmwllc.lucille.stage.AddRandomString"
fieldName: "description"
inputDataPath: "resources/dictionary.txt"
minNumOfTerms: 5
maxNumOfTerms: 50
concatenate: true # Join with spaces
}
Parameters:
inputDataPath: Text file with one term per line
minNumOfTerms: Minimum terms to select
maxNumOfTerms: Maximum terms to select
concatenate: Join terms into single string (default: false)
rangeSize: Use only first N terms from file
Create a custom dictionary file with domain-specific terms for more realistic test data.
AddRandomInt
Generate random integers:
{
name: "randomInt"
class: "com.kmwllc.lucille.stage.AddRandomInt"
fieldName: "quantity"
rangeStart: 1
rangeEnd: 1000
}
AddRandomDouble
Generate random decimal numbers:
{
name: "randomDouble"
class: "com.kmwllc.lucille.stage.AddRandomDouble"
fieldName: "price"
rangeStart: 9.99
rangeEnd: 999.99
}
AddRandomBoolean
Generate random boolean values:
{
name: "randomBoolean"
class: "com.kmwllc.lucille.stage.AddRandomBoolean"
fieldName: "in_stock"
percentTrue: 75 # 75% true, 25% false
}
AddRandomDate
Generate random dates in a range:
{
name: "randomDate"
class: "com.kmwllc.lucille.stage.AddRandomDate"
fieldName: "created_date"
rangeStartDate: "2020-01-01" # yyyy-MM-dd
rangeEndDate: "2024-12-31"
}
Output format: ISO 8601 (e.g., 2023-06-15T10:30:45.123Z)
AddRandomNestedField
Generate random nested JSON structures:
{
name: "randomNested"
class: "com.kmwllc.lucille.stage.AddRandomNestedField"
targetField: "attributes"
minNumObjects: 1
maxNumObjects: 10
# Map destination path to source or generator
entries: {
"name" = "stringGen"
"value" = "numberGen"
}
# Generators for missing fields
generators: {
stringGen = {
class: "com.kmwllc.lucille.stage.AddRandomString"
input_data_path: "resources/attributes.txt"
minNumOfTerms: 1
maxNumOfTerms: 3
concatenate: true
}
numberGen = {
class: "com.kmwllc.lucille.stage.AddRandomInt"
rangeStart: 1
rangeEnd: 100
}
}
}
Output:
{
"attributes": [
{"name": "color", "value": 42},
{"name": "size large", "value": 87},
{"name": "weight", "value": 15}
]
}
Example Output Document
Generated document structure:
{
"id": "2",
"run_id": "4a274429-50c2-45f7-b2a7-a947bad8dd1a",
"text_example": "exitance south-easterly subopaque psychopannychism moustached",
"integer_example": 75947,
"double_example": 342.04429969638477,
"boolean_example": true,
"date_example": "2021-01-30T18:34:43.983Z",
"nested_example": [
{
"field_c": "pseudomucin Corum",
"field_a": "fivebar acceptance's",
"field_b": "microfibril"
},
{
"field_c": "Tallbott",
"field_a": "Campanulaceae clumsinesses",
"field_b": "uncluttering Hartleyan"
}
]
}
Running Document Generation
Choose Indexing Mode
Index generated docs to Elasticsearch:export ES_URL="http://localhost:9200"
export ES_INDEX="test_docs"
export ES_SEND_ENABLED=true
export NUM_DOCS=1000
Generate without indexing:export ES_SEND_ENABLED=false
export NUM_DOCS=100
Create Dictionary File
Create resources/example-dictionary.txt:technology
innovation
software
development
engineering
architecture
performance
scalability
...
Build and Run
mvn clean package
./scripts/run_ingest.sh
Monitor progress:INFO SequenceConnector - Generating 1000 documents
INFO Publisher - Published 1000 documents
INFO Indexer - Indexed 1000 documents
INFO Runner - Completed in 5 seconds
Advanced Patterns
Product Catalog
Generate realistic e-commerce data:
stages: [
{
name: "productName"
class: "com.kmwllc.lucille.stage.AddRandomString"
fieldName: "name"
inputDataPath: "resources/products.txt"
minNumOfTerms: 2
maxNumOfTerms: 5
concatenate: true
},
{
name: "price"
class: "com.kmwllc.lucille.stage.AddRandomDouble"
fieldName: "price"
rangeStart: 9.99
rangeEnd: 999.99
},
{
name: "inStock"
class: "com.kmwllc.lucille.stage.AddRandomBoolean"
fieldName: "in_stock"
percentTrue: 80
},
{
name: "category"
class: "com.kmwllc.lucille.stage.SetField"
fieldName: "category"
values: ["Electronics", "Clothing", "Home", "Sports"]
randomSelection: true
}
]
User Profiles
Generate user data:
stages: [
{
name: "username"
class: "com.kmwllc.lucille.stage.Concatenate"
dest: "username"
formatString: "user_{id}"
},
{
name: "age"
class: "com.kmwllc.lucille.stage.AddRandomInt"
fieldName: "age"
rangeStart: 18
rangeEnd: 80
},
{
name: "registrationDate"
class: "com.kmwllc.lucille.stage.AddRandomDate"
fieldName: "registered_at"
rangeStartDate: "2020-01-01"
rangeEndDate: "2024-12-31"
},
{
name: "preferences"
class: "com.kmwllc.lucille.stage.AddRandomNestedField"
targetField: "preferences"
minNumObjects: 3
maxNumObjects: 10
entries: {
"key" = "prefKeyGen"
"value" = "prefValueGen"
}
generators: {
prefKeyGen = {
class: "com.kmwllc.lucille.stage.AddRandomString"
input_data_path: "resources/preference-keys.txt"
minNumOfTerms: 1
maxNumOfTerms: 1
}
prefValueGen = {
class: "com.kmwllc.lucille.stage.AddRandomString"
input_data_path: "resources/preference-values.txt"
minNumOfTerms: 1
maxNumOfTerms: 1
}
}
}
]
Log Events
Generate time-series log data:
stages: [
{
name: "timestamp"
class: "com.kmwllc.lucille.stage.AddRandomDate"
fieldName: "timestamp"
rangeStartDate: "2024-01-01"
rangeEndDate: "2024-01-31"
},
{
name: "level"
class: "com.kmwllc.lucille.stage.SetField"
fieldName: "level"
values: ["INFO", "INFO", "INFO", "WARN", "ERROR"] # Weighted
randomSelection: true
},
{
name: "message"
class: "com.kmwllc.lucille.stage.AddRandomString"
fieldName: "message"
inputDataPath: "resources/log-messages.txt"
minNumOfTerms: 5
maxNumOfTerms: 20
concatenate: true
},
{
name: "responseTime"
class: "com.kmwllc.lucille.stage.AddRandomInt"
fieldName: "response_time_ms"
rangeStart: 10
rangeEnd: 5000
}
]
High-Volume Generation
For millions of documents:
connectors: [
{
class: "com.kmwllc.lucille.connector.SequenceConnector"
numDocs: 10000000 # 10 million
pipeline: "pipeline1"
}
]
worker {
threads: 8 # Parallel processing
}
publisher {
queueCapacity: 5000
}
indexer {
batchSize: 5000
batchTimeout: 5000
}
Large dictionary files and complex nested structures can slow generation. Profile your pipeline to identify bottlenecks.
Integration with Testing
JUnit Test Example
@Test
public void testPipelineWithSyntheticData() throws Exception {
Config config = ConfigFactory.parseString("""
connectors: [{
class: "com.kmwllc.lucille.connector.SequenceConnector"
numDocs: 100
pipeline: "test"
}]
pipelines: [{
name: "test"
stages: [
{
class: "com.kmwllc.lucille.stage.AddRandomString"
fieldName: "text"
rangeSize: 100
minNumOfTerms: 10
maxNumOfTerms: 50
concatenate: true
},
{
class: "com.kmwllc.lucille.stage.YourCustomStage"
// ... your stage config
}
]
}]
indexer { type: "Mock" }
""");
Runner runner = new Runner(config);
runner.run();
// Assert your expectations
assertEquals(100, mockIndexer.getIndexedCount());
}
Next Steps
Troubleshooting
Reduce complexity:
- Use smaller dictionary files
- Reduce nested object counts
- Simplify generator configurations
Or increase parallelism:
Reduce queue capacity:publisher {
queueCapacity: 100
}
Or process in smaller batches:export NUM_DOCS=1000
./scripts/run_ingest.sh
# Run multiple times
Dictionary file not found
Ensure your dictionary file is in the correct location:ls -la resources/example-dictionary.txt
Use absolute paths if needed:inputDataPath: "/absolute/path/to/dictionary.txt"