Skip to main content

Overview

Enrichment stages enhance documents by adding information from external sources like dictionaries, databases, APIs, and specialized detection algorithms. These stages are crucial for adding context and metadata to your documents.

DetectLanguage

Detects the language of text content using langdetect library. Supports 53 languages.
source
List<String>
required
List of source fields containing text to analyze for language detection.
languageField
String
required
Field to store the detected language code (e.g., “en”, “es”, “fr”).
languageConfidenceField
String
default:"languageConfidence"
Field to store the confidence score of the language detection.
minLength
Integer
default:50
Minimum length of text to consider for language detection. Shorter strings are ignored.
maxLength
Integer
default:10000
Maximum length of text to consider. Longer strings are truncated.
minProbability
Double
Minimum confidence threshold for accepting a detection result. Results below this threshold are ignored.
updateMode
String
default:"overwrite"
How to handle existing field values: overwrite, append, or skip.

Supported Languages

Supports 53 languages including: English (en), Spanish (es), French (fr), German (de), Chinese (zh-cn, zh-tw), Japanese (ja), Korean (ko), Arabic (ar), Russian (ru), Portuguese (pt), Italian (it), Dutch (nl), and many more.

Example: Basic Language Detection

stages:
  - class: com.kmwllc.lucille.stage.DetectLanguage
    name: detect_content_language
    source: ["content", "description"]
    languageField: language
    languageConfidenceField: language_confidence
    minProbability: 0.90

Example: Multi-field Language Detection

stages:
  - class: com.kmwllc.lucille.stage.DetectLanguage
    name: detect_language
    source:
      - title
      - body
      - abstract
    languageField: detected_language
    minLength: 100
    maxLength: 5000

DictionaryLookup

Finds exact matches in a dictionary file and extracts associated payloads. Can also function as a set membership test.
source
List<String>
required
List of source field names to check against the dictionary.
dest
List<String>
required
List of destination field names. Can be:
  • Same length as source for 1-to-1 mapping
  • Single field to collect results from all source fields
dictPath
String
required
Path to the dictionary file. Dictionary format: one term per line, optionally with payload: term, payload
  • If path starts with classpath:, searches the classpath
  • Otherwise, searches the local filesystem
  • Supports S3, Azure, and GCP paths with appropriate configuration
usePayloads
Boolean
default:true
Whether to use payloads from the dictionary. If false, outputs the matched term instead of its payload.
ignoreCase
Boolean
default:false
Whether to perform case-insensitive matching.
setOnly
Boolean
default:false
Use as a set membership test. When true:
  • Destination field is set to boolean true/false
  • updateMode must be overwrite
  • Use with useAnyMatch and ignoreMissingSource for fine control
useAnyMatch
Boolean
default:false
Only valid with setOnly=true. If true, returns true if ANY value matches. If false, returns true only if ALL values match.
ignoreMissingSource
Boolean
default:false
Only valid with setOnly=true. If true, treats missing source fields as a match.
updateMode
String
default:"overwrite"
How to handle existing destination field values. Cannot be used with setOnly=true.

Example: Category Tagging

stages:
  - class: com.kmwllc.lucille.stage.DictionaryLookup
    name: tag_categories
    source: ["keywords"]
    dest: ["categories"]
    dictPath: "/data/dictionaries/category_mapping.txt"
    usePayloads: true
    ignoreCase: true
Dictionary file (category_mapping.txt):
machine learning, AI
artificial intelligence, AI
deep learning, AI
python, Programming
java, Programming

Example: Set Membership Check

stages:
  - class: com.kmwllc.lucille.stage.DictionaryLookup
    name: check_valid_status
    source: ["status"]
    dest: ["is_valid_status"]
    dictPath: "/data/valid_statuses.txt"
    setOnly: true
    useAnyMatch: false  # All values must be in dictionary

Example: S3 Dictionary

stages:
  - class: com.kmwllc.lucille.stage.DictionaryLookup
    name: lookup_from_s3
    source: ["product_code"]
    dest: ["product_name"]
    dictPath: "s3://my-bucket/dictionaries/products.txt"
    s3:
      region: us-east-1
      accessKey: ${AWS_ACCESS_KEY}
      secretKey: ${AWS_SECRET_KEY}

QueryDatabase

Executs SQL queries against a database to enrich documents.
jdbcUrl
String
required
JDBC connection URL for the database.
query
String
required
SQL query to execute. Can use ? placeholders for parameterized queries.
parameters
List<String>
List of field names to use as query parameters, corresponding to ? placeholders.
resultMapping
Map<String, String>
Mapping of column names from query results to document field names.
username
String
Database username.
password
String
Database password.

Example: User Lookup

stages:
  - class: com.kmwllc.lucille.stage.QueryDatabase
    name: enrich_user_data
    jdbcUrl: "jdbc:postgresql://localhost:5432/userdb"
    username: ${DB_USER}
    password: ${DB_PASS}
    query: "SELECT name, email, department FROM users WHERE user_id = ?"
    parameters: ["user_id"]
    resultMapping:
      name: user_name
      email: user_email
      department: user_department

ElasticsearchLookup

Performs lookups against an Elasticsearch index to enrich documents.
host
String
required
Elasticsearch host URL.
index
String
required
Index name to query.
queryField
String
required
Document field to use as the query term.
searchField
String
required
Elasticsearch field to search against.
resultMapping
Map<String, String>
Mapping of Elasticsearch fields to document fields.

Example: Product Enrichment

stages:
  - class: com.kmwllc.lucille.stage.ElasticsearchLookup
    name: enrich_product_info
    host: "http://localhost:9200"
    index: products
    queryField: product_id
    searchField: id
    resultMapping:
      product_name: name
      product_price: price
      product_category: category

QueryOpensearch

Performs lookups against an OpenSearch index to enrich documents.
endpoint
String
required
OpenSearch endpoint URL.
index
String
required
Index name to query.
queryField
String
required
Document field to use for the query.
searchField
String
required
OpenSearch field to match against.
destFields
Map<String, String>
Mapping of OpenSearch response fields to document fields.
region
String
AWS region (for AWS OpenSearch).

Example: Document Enrichment from OpenSearch

stages:
  - class: com.kmwllc.lucille.stage.QueryOpensearch
    name: opensearch_lookup
    endpoint: "https://search-domain.us-east-1.es.amazonaws.com"
    region: us-east-1
    index: metadata
    queryField: doc_id
    searchField: id
    destFields:
      title: enriched_title
      summary: enriched_summary

ExtractEntities

Extracts named entities (person names, locations, organizations) using OpenNLP.
source
String
required
Source field containing text to extract entities from.
personField
String
Destination field for person names.
locationField
String
Destination field for location names.
organizationField
String
Destination field for organization names.
updateMode
String
default:"overwrite"
How to handle existing field values.

Example: Extract Named Entities

stages:
  - class: com.kmwllc.lucille.stage.ExtractEntities
    name: extract_entities
    source: content
    personField: people
    locationField: locations
    organizationField: organizations

ExtractEntitiesFST

Extracts entities using Finite State Transducer (FST) for fast dictionary-based entity recognition.
source
String
required
Source field containing text.
dest
String
required
Destination field for extracted entities.
dictPath
String
required
Path to the FST dictionary file.
updateMode
String
default:"overwrite"
How to handle existing field values.

Example: Fast Entity Extraction

stages:
  - class: com.kmwllc.lucille.stage.ExtractEntitiesFST
    name: extract_medical_terms
    source: medical_notes
    dest: medical_entities
    dictPath: "/data/fst/medical_terms.fst"

FetchUri

Fetches content from URLs specified in document fields.
source
String
required
Field containing the URL to fetch.
dest
String
required
Destination field for fetched content.
timeout
Integer
Request timeout in milliseconds.
headers
Map<String, String>
HTTP headers to include in the request.

Example: Fetch Web Content

stages:
  - class: com.kmwllc.lucille.stage.FetchUri
    name: fetch_webpage
    source: url
    dest: html_content
    timeout: 30000
    headers:
      User-Agent: "Lucille ETL Bot 1.0"

FetchFileContent

Fetches file content from local filesystem or cloud storage.
source
String
required
Field containing the file path.
dest
String
required
Destination field for file content.
encoding
String
Character encoding for text files.

Example: Load File Content

stages:
  - class: com.kmwllc.lucille.stage.FetchFileContent
    name: load_documents
    source: file_path
    dest: content
    encoding: UTF-8

ApplyFileHandlers

Applies file handlers to extract content and metadata from files (PDF, Word, images, etc.) using Apache Tika.
pathField
String
required
Field containing the file path or URL.
contentField
String
default:"content"
Destination field for extracted text content.
metadataPrefix
String
Prefix for metadata fields extracted from the file.

Example: Extract PDF Content

stages:
  - class: com.kmwllc.lucille.stage.ApplyFileHandlers
    name: extract_file_content
    pathField: file_path
    contentField: extracted_text
    metadataPrefix: file_meta_

CreateStaticTeaser

Creates a teaser (snippet/summary) from text content.
source
String
required
Source field containing full text.
dest
String
required
Destination field for the teaser.
length
Integer
required
Maximum length of the teaser in characters.
breakOnWord
Boolean
default:true
Whether to break on word boundaries to avoid cutting words in half.

Example: Generate Summary

stages:
  - class: com.kmwllc.lucille.stage.CreateStaticTeaser
    name: create_summary
    source: content
    dest: summary
    length: 300
    breakOnWord: true

MatchQuery

Matches field values against a Lucene query string.
query
String
required
Lucene query string to match against.
resultField
String
default:"query_match"
Field to store the boolean match result.

Example: Content Classification

stages:
  - class: com.kmwllc.lucille.stage.MatchQuery
    name: classify_technical
    query: "content:(kubernetes OR docker OR microservices)"
    resultField: is_technical_content

Contains

Checks if field values contain specific substrings.
source
List<String>
required
List of source fields to check.
dest
String
required
Destination field for boolean result.
values
List<String>
required
List of values to search for.
ignoreCase
Boolean
default:false
Whether to perform case-insensitive matching.

Example: Check for Keywords

stages:
  - class: com.kmwllc.lucille.stage.Contains
    name: check_urgent
    source: ["subject", "body"]
    dest: is_urgent
    values: ["urgent", "asap", "emergency"]
    ignoreCase: true