Skip to main content

Overview

Text processing stages manipulate string content in document fields. These stages handle common text operations like pattern matching, normalization, concatenation, and formatting.

ApplyRegex

Extracts text based on regular expressions. Supports capturing groups and multiple regex flags.
source
List<String>
required
List of source field names to apply regex extraction to.
dest
List<String>
required
List of destination field names. Can be:
  • Same length as source for 1-to-1 mapping
  • Single field to collect results from all source fields
regex
String
required
Regular expression to find matches. If the regex includes capturing groups, the value of the first group will be used.
updateMode
String
default:"overwrite"
How to handle existing destination field values: overwrite, append, or skip.
ignoreCase
Boolean
default:false
Whether the regex matcher should ignore case.
multiline
Boolean
default:false
Whether the regex matcher should allow matches across multiple lines.
dotall
Boolean
default:false
Enables DOTALL functionality for the regex matcher (. matches newlines).
literal
Boolean
default:false
Treats the regex expression as a literal string rather than a pattern.

Example: Extract Email Addresses

stages:
  - class: com.kmwllc.lucille.stage.ApplyRegex
    name: extract_emails
    source: ["content", "description"]
    dest: ["emails"]
    regex: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
    updateMode: append

Example: Extract with Capturing Group

stages:
  - class: com.kmwllc.lucille.stage.ApplyRegex
    name: extract_phone_area_code
    source: ["phone"]
    dest: ["area_code"]
    regex: "\\((\\d{3})\\)"
    # Captures the digits inside parentheses

Concatenate

Replaces wildcards in a format string with field values to create concatenated output.
dest
String
required
Destination field for the concatenated result.
formatString
String
required
Format string with field placeholders wrapped in {}.Example: "{city}, {state}, {country}""Boston, MA, USA"
defaultInputs
Map<String, String>
Mapping of field names to default values. If a field is missing or empty, the default value will be used. If no default is provided, the placeholder remains in the output.
updateMode
String
default:"overwrite"
How to handle existing destination field values: overwrite, append, or skip.
If a field is multivalued, only the first value will be used in the concatenation.

Example: Create Display Name

stages:
  - class: com.kmwllc.lucille.stage.Concatenate
    name: create_display_name
    dest: display_name
    formatString: "{first_name} {middle_initial}. {last_name}"
    defaultInputs:
      middle_initial: "X"

Example: Format Address

stages:
  - class: com.kmwllc.lucille.stage.Concatenate
    name: format_address
    dest: full_address
    formatString: "{street}, {city}, {state} {zip}"

NormalizeText

Normalizes text case using four different modes: lowercase, uppercase, title case, and sentence case.
source
List<String>
required
List of source field names.
dest
List<String>
required
List of destination field names. Can be:
  • Same length as source for 1-to-1 mapping
  • Single field to collect results from all source fields
mode
String
required
Normalization mode:
  • lowercase - Convert all text to lowercase
  • uppercase - Convert all text to UPPERCASE
  • title_case - Convert To Title Case (first letter of each word capitalized)
  • sentence_case - Convert to sentence case (first letter of each sentence capitalized)
updateMode
String
default:"overwrite"
How to handle existing destination field values: overwrite, append, or skip.
This stage will not preserve original capitalization. Proper nouns, abbreviations, and acronyms may not be correctly capitalized after normalization.
stages:
  - class: com.kmwllc.lucille.stage.NormalizeText
    name: lowercase_content
    source: ["title", "description"]
    dest: ["title_normalized", "description_normalized"]
    mode: lowercase

Example: Title Case Formatting

stages:
  - class: com.kmwllc.lucille.stage.NormalizeText
    name: format_title
    source: ["product_name"]
    dest: ["product_name_display"]
    mode: title_case

TrimWhitespace

Removes leading and trailing whitespace from field values.
fields
List<String>
required
List of fields to trim whitespace from. At least one field must be specified.

Example: Clean Input Fields

stages:
  - class: com.kmwllc.lucille.stage.TrimWhitespace
    name: clean_fields
    fields:
      - name
      - email
      - address
      - description

ReplacePatterns

Replaces text patterns using regular expressions with specified replacement strings.
source
List<String>
required
List of source field names.
dest
List<String>
required
List of destination field names. Must be same length as source or single field.
patterns
List<String>
required
List of regex patterns to search for.
replacements
List<String>
required
List of replacement strings corresponding to patterns. Must be same length as patterns.
updateMode
String
default:"overwrite"
How to handle existing destination field values.

Example: Clean HTML

stages:
  - class: com.kmwllc.lucille.stage.ReplacePatterns
    name: remove_html_tags
    source: ["content"]
    dest: ["content_clean"]
    patterns:
      - "<[^>]+>"  # Remove HTML tags
      - "&nbsp;"   # Remove HTML entities
    replacements:
      - ""
      - " "

Example: Normalize Whitespace

stages:
  - class: com.kmwllc.lucille.stage.ReplacePatterns
    name: normalize_whitespace
    source: ["text"]
    dest: ["text"]
    patterns:
      - "\\s+"  # Multiple whitespace
    replacements:
      - " "    # Single space

RemoveDiacritics

Removes diacritical marks (accents) from text, converting characters to their base forms.
fields
List<String>
required
List of fields to remove diacritics from.

Example: Normalize International Text

stages:
  - class: com.kmwllc.lucille.stage.RemoveDiacritics
    name: remove_accents
    fields: ["name", "description"]
    # "café" → "cafe"
    # "naïve" → "naive"

SplitFieldValues

Splits field values into multiple values using a delimiter or regex pattern.
fields
List<String>
required
List of fields to split.
delimiter
String
String delimiter to split on. Either delimiter or regex must be specified.
regex
String
Regular expression pattern to split on. Either delimiter or regex must be specified.

Example: Split Tags

stages:
  - class: com.kmwllc.lucille.stage.SplitFieldValues
    name: split_tags
    fields: ["tags"]
    delimiter: ","
    # "tag1,tag2,tag3" → ["tag1", "tag2", "tag3"]

Example: Split by Whitespace

stages:
  - class: com.kmwllc.lucille.stage.SplitFieldValues
    name: tokenize_text
    fields: ["keywords"]
    regex: "\\s+"

TruncateField

Truncates field values to a maximum length.
fields
List<String>
required
List of fields to truncate.
maxLength
Integer
required
Maximum length for field values. Values longer than this will be truncated.

Example: Limit Description Length

stages:
  - class: com.kmwllc.lucille.stage.TruncateField
    name: truncate_summary
    fields: ["summary", "description"]
    maxLength: 500

ExtractFirstCharacter

Extracts the first character from field values.
source
List<String>
required
List of source field names.
dest
List<String>
required
List of destination field names.
updateMode
String
default:"overwrite"
How to handle existing destination field values.

Example: Create Alphabetical Index

stages:
  - class: com.kmwllc.lucille.stage.ExtractFirstCharacter
    name: create_alpha_index
    source: ["last_name"]
    dest: ["last_name_initial"]
    # "Smith" → "S"

ApplyJSoup

Parses HTML content using JSoup and extracts elements based on CSS selectors.
source
String
required
Source field containing HTML content.
dest
String
required
Destination field for extracted content.
selector
String
required
CSS selector to identify elements to extract.
attribute
String
HTML attribute to extract. If not specified, extracts text content.
updateMode
String
default:"overwrite"
How to handle existing destination field values.
stages:
  - class: com.kmwllc.lucille.stage.ApplyJSoup
    name: extract_links
    source: html_content
    dest: links
    selector: "a"
    attribute: "href"

Example: Extract Article Text

stages:
  - class: com.kmwllc.lucille.stage.ApplyJSoup
    name: extract_article
    source: html
    dest: article_text
    selector: "article.main-content"

XPathExtractor

Extracts data from XML using XPath expressions.
source
String
required
Source field containing XML content.
xpathMapping
Map<String, String>
required
Mapping of XPath expressions to destination field names.
updateMode
String
default:"overwrite"
How to handle existing destination field values.

Example: Extract XML Metadata

stages:
  - class: com.kmwllc.lucille.stage.XPathExtractor
    name: extract_metadata
    source: xml_content
    xpathMapping:
      "/book/title": book_title
      "/book/author": book_author
      "/book/isbn": isbn