What are Stages?
Stages are the core processing units in Lucille’s ETL pipeline. Each stage performs a specific operation on documents as they flow through the pipeline. Stages can transform data, enrich content, extract information, apply ML models, and much more.Stage Architecture
All stages extend the baseStage class and implement the processDocument() method:
Key Concepts
Document Processing
Document Processing
Each stage processes a
Document object in place and can optionally return child documents. The stage can:- Modify existing fields
- Add new fields
- Remove fields
- Generate child documents
- Drop documents from the pipeline
Stage Lifecycle
Stage Lifecycle
Stages have a defined lifecycle:
- Construction - Stage is instantiated with a
Configobject - Initialization -
initialize()sets up metrics and naming - Start -
start()performs any required setup (loading resources, establishing connections) - Processing -
processDocument()is called for each document - Stop -
stop()releases resources and closes connections
Configuration
Configuration
Every stage must declare a
public static Spec SPEC that defines its configuration parameters. Common configuration patterns:source/dest- Field mappingupdateMode- How to handle existing field values (overwrite, append, skip)conditions- Conditional execution based on document state- Stage-specific parameters
Conditional Execution
Stages can be configured to execute conditionally based on document field values.List of conditions that determine if the stage should process a document.Each condition can specify:
fields- Fields to check for existence or valuesvalues- Specific values to match againstoperator- Eithermustormust_not
How to combine multiple conditions:
all- All conditions must be satisfied (AND logic)any- At least one condition must be satisfied (OR logic)
Example: Conditional Stage Execution
Update Modes
Many stages support anupdateMode parameter that controls how destination fields are updated:
- overwrite
- append
- skip
Replace any existing values in the destination field with new values.
Common Configuration Parameters
Unique identifier for the stage instance. Used in logging and metrics. If not specified, defaults to
stage_N where N is the position in the pipeline.Fully qualified class name of the stage implementation.Example:
com.kmwllc.lucille.stage.ApplyRegexStage Categories
Lucille includes 60+ built-in stages organized into functional categories:Text Processing
Stages for manipulating and transforming text content:- Regular expression matching and extraction
- Text normalization and case conversion
- String concatenation and formatting
- Whitespace trimming
- Pattern replacement
Data Transformation
Stages for field operations and data manipulation:- Copying and renaming fields
- Field deletion and value filtering
- Type conversion and parsing
- Static value assignment
- Field flattening and restructuring
Enrichment
Stages for enhancing documents with additional information:- Language detection
- Dictionary and entity lookup
- Database queries
- External API calls
- Geolocation enrichment
AI & Machine Learning
Stages leveraging AI/ML models:- OpenAI embeddings
- LLM prompting (Ollama)
- Text chunking for RAG
- Vector generation
- Entity extraction
Creating Custom Stages
To create a custom stage:- Extend the Stage class
- Define the Spec
SpecBuilder to declare required and optional parameters.
- Implement processDocument()
null if no child documents are generated.
- Handle Resources
start() and stop() for resource management to ensure proper cleanup.
Best Practices
Validate Configuration
Use the Spec to validate all configuration at startup, not during document processing.
Handle Missing Fields
Always check if fields exist before accessing them using
doc.has(fieldName).Use UpdateMode
Support
updateMode parameter to give users control over field updates.Log Appropriately
Use appropriate log levels and include document IDs in error messages.
Performance Considerations
- Minimize I/O - Perform expensive operations (file access, network calls) in
start()when possible - Batch Operations - Process multiple fields or values together to reduce overhead
- Lazy Loading - Only load resources when first needed
- Thread Safety - Stages must be thread-safe as multiple workers may use the same instance
Error Handling
Stages should throwStageException for processing errors:
- Skip the document and continue
- Halt processing
- Route to an error handler
Metrics
Each stage automatically tracks:- processDocumentTime - Time spent processing documents
- errors - Count of errors encountered
- children - Count of child documents generated