Enrichment Stages

Overview

Enrichment stages enhance documents by adding information from external sources like dictionaries, databases, APIs, and specialized detection algorithms. These stages are crucial for adding context and metadata to your documents.

DetectLanguage

Detects the language of text content using langdetect library. Supports 53 languages.

source

List<String>

required

List of source fields containing text to analyze for language detection.

languageField

String

required

Field to store the detected language code (e.g., “en”, “es”, “fr”).

languageConfidenceField

String

default:"languageConfidence"

Field to store the confidence score of the language detection.

minLength

Integer

default:50

Minimum length of text to consider for language detection. Shorter strings are ignored.

maxLength

Integer

default:10000

Maximum length of text to consider. Longer strings are truncated.

minProbability

Double

Minimum confidence threshold for accepting a detection result. Results below this threshold are ignored.

updateMode

String

default:"overwrite"

How to handle existing field values: overwrite, append, or skip.

Supported Languages

Supports 53 languages including: English (en), Spanish (es), French (fr), German (de), Chinese (zh-cn, zh-tw), Japanese (ja), Korean (ko), Arabic (ar), Russian (ru), Portuguese (pt), Italian (it), Dutch (nl), and many more.

Example: Basic Language Detection

stages:
  - class: com.kmwllc.lucille.stage.DetectLanguage
    name: detect_content_language
    source: ["content", "description"]
    languageField: language
    languageConfidenceField: language_confidence
    minProbability: 0.90

Example: Multi-field Language Detection

stages:
  - class: com.kmwllc.lucille.stage.DetectLanguage
    name: detect_language
    source:
      - title
      - body
      - abstract
    languageField: detected_language
    minLength: 100
    maxLength: 5000

DictionaryLookup

Finds exact matches in a dictionary file and extracts associated payloads. Can also function as a set membership test.

source

List<String>

required

List of source field names to check against the dictionary.

dest

List<String>

required

List of destination field names. Can be:

Same length as source for 1-to-1 mapping
Single field to collect results from all source fields

dictPath

String

required

Path to the dictionary file. Dictionary format: one term per line, optionally with payload: term, payload

If path starts with classpath:, searches the classpath
Otherwise, searches the local filesystem
Supports S3, Azure, and GCP paths with appropriate configuration

usePayloads

Boolean

default:true

Whether to use payloads from the dictionary. If false, outputs the matched term instead of its payload.

ignoreCase

Boolean

default:false

Whether to perform case-insensitive matching.

setOnly

Boolean

default:false

Use as a set membership test. When true:

Destination field is set to boolean true/false
updateMode must be overwrite
Use with useAnyMatch and ignoreMissingSource for fine control

useAnyMatch

Boolean

default:false

Only valid with setOnly=true. If true, returns true if ANY value matches. If false, returns true only if ALL values match.

ignoreMissingSource

Boolean

default:false

Only valid with setOnly=true. If true, treats missing source fields as a match.

updateMode

String

default:"overwrite"

How to handle existing destination field values. Cannot be used with setOnly=true.

Example: Category Tagging

stages:
  - class: com.kmwllc.lucille.stage.DictionaryLookup
    name: tag_categories
    source: ["keywords"]
    dest: ["categories"]
    dictPath: "/data/dictionaries/category_mapping.txt"
    usePayloads: true
    ignoreCase: true

Dictionary file (category_mapping.txt):

machine learning, AI
artificial intelligence, AI
deep learning, AI
python, Programming
java, Programming

Example: Set Membership Check

stages:
  - class: com.kmwllc.lucille.stage.DictionaryLookup
    name: check_valid_status
    source: ["status"]
    dest: ["is_valid_status"]
    dictPath: "/data/valid_statuses.txt"
    setOnly: true
    useAnyMatch: false  # All values must be in dictionary

Example: S3 Dictionary

stages:
  - class: com.kmwllc.lucille.stage.DictionaryLookup
    name: lookup_from_s3
    source: ["product_code"]
    dest: ["product_name"]
    dictPath: "s3://my-bucket/dictionaries/products.txt"
    s3:
      region: us-east-1
      accessKey: ${AWS_ACCESS_KEY}
      secretKey: ${AWS_SECRET_KEY}

QueryDatabase

Executs SQL queries against a database to enrich documents.

jdbcUrl

String

required

JDBC connection URL for the database.

query

String

required

SQL query to execute. Can use ? placeholders for parameterized queries.

parameters

List<String>

List of field names to use as query parameters, corresponding to ? placeholders.

resultMapping

Map<String, String>

Mapping of column names from query results to document field names.

username

String

Database username.

password

String

Database password.

Example: User Lookup

stages:
  - class: com.kmwllc.lucille.stage.QueryDatabase
    name: enrich_user_data
    jdbcUrl: "jdbc:postgresql://localhost:5432/userdb"
    username: ${DB_USER}
    password: ${DB_PASS}
    query: "SELECT name, email, department FROM users WHERE user_id = ?"
    parameters: ["user_id"]
    resultMapping:
      name: user_name
      email: user_email
      department: user_department

ElasticsearchLookup

Performs lookups against an Elasticsearch index to enrich documents.

host

String

required

Elasticsearch host URL.

index

String

required

Index name to query.

queryField

String

required

Document field to use as the query term.

searchField

String

required

Elasticsearch field to search against.

resultMapping

Map<String, String>

Mapping of Elasticsearch fields to document fields.

Example: Product Enrichment

stages:
  - class: com.kmwllc.lucille.stage.ElasticsearchLookup
    name: enrich_product_info
    host: "http://localhost:9200"
    index: products
    queryField: product_id
    searchField: id
    resultMapping:
      product_name: name
      product_price: price
      product_category: category

QueryOpensearch

Performs lookups against an OpenSearch index to enrich documents.

endpoint

String

required

OpenSearch endpoint URL.

index

String

required

Index name to query.

queryField

String

required

Document field to use for the query.

searchField

String

required

OpenSearch field to match against.

destFields

Map<String, String>

Mapping of OpenSearch response fields to document fields.

region

String

AWS region (for AWS OpenSearch).

Example: Document Enrichment from OpenSearch

stages:
  - class: com.kmwllc.lucille.stage.QueryOpensearch
    name: opensearch_lookup
    endpoint: "https://search-domain.us-east-1.es.amazonaws.com"
    region: us-east-1
    index: metadata
    queryField: doc_id
    searchField: id
    destFields:
      title: enriched_title
      summary: enriched_summary

ExtractEntities

Extracts named entities (person names, locations, organizations) using OpenNLP.

source

String

required

Source field containing text to extract entities from.

personField

String

Destination field for person names.

locationField

String

Destination field for location names.

organizationField

String

Destination field for organization names.

updateMode

String

default:"overwrite"

How to handle existing field values.

Example: Extract Named Entities

stages:
  - class: com.kmwllc.lucille.stage.ExtractEntities
    name: extract_entities
    source: content
    personField: people
    locationField: locations
    organizationField: organizations

ExtractEntitiesFST

Extracts entities using Finite State Transducer (FST) for fast dictionary-based entity recognition.

source

String

required

Source field containing text.

dest

String

required

Destination field for extracted entities.

dictPath

String

required

Path to the FST dictionary file.

updateMode

String

default:"overwrite"

How to handle existing field values.

Example: Fast Entity Extraction

stages:
  - class: com.kmwllc.lucille.stage.ExtractEntitiesFST
    name: extract_medical_terms
    source: medical_notes
    dest: medical_entities
    dictPath: "/data/fst/medical_terms.fst"

FetchUri

Fetches content from URLs specified in document fields.

source

String

required

Field containing the URL to fetch.

dest

String

required

Destination field for fetched content.

timeout

Integer

Request timeout in milliseconds.

headers

Map<String, String>

HTTP headers to include in the request.

Example: Fetch Web Content

stages:
  - class: com.kmwllc.lucille.stage.FetchUri
    name: fetch_webpage
    source: url
    dest: html_content
    timeout: 30000
    headers:
      User-Agent: "Lucille ETL Bot 1.0"

FetchFileContent

Fetches file content from local filesystem or cloud storage.

source

String

required

Field containing the file path.

dest

String

required

Destination field for file content.

encoding

String

Character encoding for text files.

Example: Load File Content

stages:
  - class: com.kmwllc.lucille.stage.FetchFileContent
    name: load_documents
    source: file_path
    dest: content
    encoding: UTF-8

ApplyFileHandlers

Applies file handlers to extract content and metadata from files (PDF, Word, images, etc.) using Apache Tika.

pathField

String

required

Field containing the file path or URL.

contentField

String

default:"content"

Destination field for extracted text content.

metadataPrefix

String

Prefix for metadata fields extracted from the file.

Example: Extract PDF Content

stages:
  - class: com.kmwllc.lucille.stage.ApplyFileHandlers
    name: extract_file_content
    pathField: file_path
    contentField: extracted_text
    metadataPrefix: file_meta_

CreateStaticTeaser

Creates a teaser (snippet/summary) from text content.

source

String

required

Source field containing full text.

dest

String

required

Destination field for the teaser.

length

Integer

required

Maximum length of the teaser in characters.

breakOnWord

Boolean

default:true

Whether to break on word boundaries to avoid cutting words in half.

Example: Generate Summary

stages:
  - class: com.kmwllc.lucille.stage.CreateStaticTeaser
    name: create_summary
    source: content
    dest: summary
    length: 300
    breakOnWord: true

MatchQuery

Matches field values against a Lucene query string.

query

String

required

Lucene query string to match against.

resultField

String

default:"query_match"

Field to store the boolean match result.

Example: Content Classification

stages:
  - class: com.kmwllc.lucille.stage.MatchQuery
    name: classify_technical
    query: "content:(kubernetes OR docker OR microservices)"
    resultField: is_technical_content

Contains

Checks if field values contain specific substrings.

source

List<String>

required

List of source fields to check.

dest

String

required

Destination field for boolean result.

values

List<String>

required

List of values to search for.

ignoreCase

Boolean

default:false

Whether to perform case-insensitive matching.

Example: Check for Keywords

stages:
  - class: com.kmwllc.lucille.stage.Contains
    name: check_urgent
    source: ["subject", "body"]
    dest: is_urgent
    values: ["urgent", "asap", "emergency"]
    ignoreCase: true

Connectors

Stages

Indexers

Plugins

Enrichment Stages

Overview

DetectLanguage

Supported Languages

Example: Basic Language Detection

Example: Multi-field Language Detection

DictionaryLookup

Example: Category Tagging

Example: Set Membership Check

Example: S3 Dictionary

QueryDatabase

Example: User Lookup

ElasticsearchLookup

Example: Product Enrichment

QueryOpensearch

Example: Document Enrichment from OpenSearch

ExtractEntities

Example: Extract Named Entities

ExtractEntitiesFST

Example: Fast Entity Extraction

FetchUri

Example: Fetch Web Content

FetchFileContent

Example: Load File Content

ApplyFileHandlers

Example: Extract PDF Content

CreateStaticTeaser

Example: Generate Summary

MatchQuery

Example: Content Classification

Contains

Example: Check for Keywords

Connectors

Stages

Indexers

Plugins

​Overview

​DetectLanguage

​Supported Languages

​Example: Basic Language Detection

​Example: Multi-field Language Detection

​DictionaryLookup

​Example: Category Tagging

​Example: Set Membership Check

​Example: S3 Dictionary

​QueryDatabase

​Example: User Lookup

​ElasticsearchLookup

​Example: Product Enrichment

​QueryOpensearch

​Example: Document Enrichment from OpenSearch

​ExtractEntities

​Example: Extract Named Entities

​ExtractEntitiesFST

​Example: Fast Entity Extraction

​FetchUri

​Example: Fetch Web Content

​FetchFileContent

​Example: Load File Content

​ApplyFileHandlers

​Example: Extract PDF Content

​CreateStaticTeaser

​Example: Generate Summary

​MatchQuery

​Example: Content Classification

​Contains

​Example: Check for Keywords

Overview

DetectLanguage

Supported Languages

Example: Basic Language Detection

Example: Multi-field Language Detection

DictionaryLookup

Example: Category Tagging

Example: Set Membership Check

Example: S3 Dictionary

QueryDatabase

Example: User Lookup

ElasticsearchLookup

Example: Product Enrichment

QueryOpensearch

Example: Document Enrichment from OpenSearch

ExtractEntities

Example: Extract Named Entities

ExtractEntitiesFST

Example: Fast Entity Extraction

FetchUri

Example: Fetch Web Content

FetchFileContent

Example: Load File Content

ApplyFileHandlers

Example: Extract PDF Content

CreateStaticTeaser

Example: Generate Summary

MatchQuery

Example: Content Classification

Contains

Example: Check for Keywords