Text Processing Stages

Overview

Text processing stages manipulate string content in document fields. These stages handle common text operations like pattern matching, normalization, concatenation, and formatting.

ApplyRegex

Extracts text based on regular expressions. Supports capturing groups and multiple regex flags.

source

List<String>

required

List of source field names to apply regex extraction to.

dest

List<String>

required

List of destination field names. Can be:

Same length as source for 1-to-1 mapping
Single field to collect results from all source fields

regex

String

required

Regular expression to find matches. If the regex includes capturing groups, the value of the first group will be used.

updateMode

String

default:"overwrite"

How to handle existing destination field values: overwrite, append, or skip.

ignoreCase

Boolean

default:false

Whether the regex matcher should ignore case.

multiline

Boolean

default:false

Whether the regex matcher should allow matches across multiple lines.

dotall

Boolean

default:false

Enables DOTALL functionality for the regex matcher (. matches newlines).

literal

Boolean

default:false

Treats the regex expression as a literal string rather than a pattern.

Example: Extract Email Addresses

stages:
  - class: com.kmwllc.lucille.stage.ApplyRegex
    name: extract_emails
    source: ["content", "description"]
    dest: ["emails"]
    regex: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
    updateMode: append

Example: Extract with Capturing Group

stages:
  - class: com.kmwllc.lucille.stage.ApplyRegex
    name: extract_phone_area_code
    source: ["phone"]
    dest: ["area_code"]
    regex: "\\((\\d{3})\\)"
    # Captures the digits inside parentheses

Concatenate

Replaces wildcards in a format string with field values to create concatenated output.

dest

String

required

Destination field for the concatenated result.

formatString

String

required

Format string with field placeholders wrapped in {}.Example: "{city}, {state}, {country}" → "Boston, MA, USA"

defaultInputs

Map<String, String>

Mapping of field names to default values. If a field is missing or empty, the default value will be used. If no default is provided, the placeholder remains in the output.

updateMode

String

default:"overwrite"

How to handle existing destination field values: overwrite, append, or skip.

If a field is multivalued, only the first value will be used in the concatenation.

Example: Create Display Name

stages:
  - class: com.kmwllc.lucille.stage.Concatenate
    name: create_display_name
    dest: display_name
    formatString: "{first_name} {middle_initial}. {last_name}"
    defaultInputs:
      middle_initial: "X"

Example: Format Address

stages:
  - class: com.kmwllc.lucille.stage.Concatenate
    name: format_address
    dest: full_address
    formatString: "{street}, {city}, {state} {zip}"

NormalizeText

Normalizes text case using four different modes: lowercase, uppercase, title case, and sentence case.

source

List<String>

required

List of source field names.

dest

List<String>

required

List of destination field names. Can be:

Same length as source for 1-to-1 mapping
Single field to collect results from all source fields

mode

String

required

Normalization mode:

lowercase - Convert all text to lowercase
uppercase - Convert all text to UPPERCASE
title_case - Convert To Title Case (first letter of each word capitalized)
sentence_case - Convert to sentence case (first letter of each sentence capitalized)

updateMode

String

default:"overwrite"

How to handle existing destination field values: overwrite, append, or skip.

This stage will not preserve original capitalization. Proper nouns, abbreviations, and acronyms may not be correctly capitalized after normalization.

Example: Normalize for Search

stages:
  - class: com.kmwllc.lucille.stage.NormalizeText
    name: lowercase_content
    source: ["title", "description"]
    dest: ["title_normalized", "description_normalized"]
    mode: lowercase

Example: Title Case Formatting

stages:
  - class: com.kmwllc.lucille.stage.NormalizeText
    name: format_title
    source: ["product_name"]
    dest: ["product_name_display"]
    mode: title_case

TrimWhitespace

Removes leading and trailing whitespace from field values.

fields

List<String>

required

List of fields to trim whitespace from. At least one field must be specified.

Example: Clean Input Fields

stages:
  - class: com.kmwllc.lucille.stage.TrimWhitespace
    name: clean_fields
    fields:
      - name
      - email
      - address
      - description

ReplacePatterns

Replaces text patterns using regular expressions with specified replacement strings.

source

List<String>

required

List of source field names.

dest

List<String>

required

List of destination field names. Must be same length as source or single field.

patterns

List<String>

required

List of regex patterns to search for.

replacements

List<String>

required

List of replacement strings corresponding to patterns. Must be same length as patterns.

updateMode

String

default:"overwrite"

How to handle existing destination field values.

Example: Clean HTML

stages:
  - class: com.kmwllc.lucille.stage.ReplacePatterns
    name: remove_html_tags
    source: ["content"]
    dest: ["content_clean"]
    patterns:
      - "<[^>]+>"  # Remove HTML tags
      - "&nbsp;"   # Remove HTML entities
    replacements:
      - ""
      - " "

Example: Normalize Whitespace

stages:
  - class: com.kmwllc.lucille.stage.ReplacePatterns
    name: normalize_whitespace
    source: ["text"]
    dest: ["text"]
    patterns:
      - "\\s+"  # Multiple whitespace
    replacements:
      - " "    # Single space

RemoveDiacritics

Removes diacritical marks (accents) from text, converting characters to their base forms.

fields

List<String>

required

List of fields to remove diacritics from.

Example: Normalize International Text

stages:
  - class: com.kmwllc.lucille.stage.RemoveDiacritics
    name: remove_accents
    fields: ["name", "description"]
    # "café" → "cafe"
    # "naïve" → "naive"

SplitFieldValues

Splits field values into multiple values using a delimiter or regex pattern.

fields

List<String>

required

List of fields to split.

delimiter

String

String delimiter to split on. Either delimiter or regex must be specified.

regex

String

Regular expression pattern to split on. Either delimiter or regex must be specified.

Example: Split Tags

stages:
  - class: com.kmwllc.lucille.stage.SplitFieldValues
    name: split_tags
    fields: ["tags"]
    delimiter: ","
    # "tag1,tag2,tag3" → ["tag1", "tag2", "tag3"]

Example: Split by Whitespace

stages:
  - class: com.kmwllc.lucille.stage.SplitFieldValues
    name: tokenize_text
    fields: ["keywords"]
    regex: "\\s+"

TruncateField

Truncates field values to a maximum length.

fields

List<String>

required

List of fields to truncate.

maxLength

Integer

required

Maximum length for field values. Values longer than this will be truncated.

Example: Limit Description Length

stages:
  - class: com.kmwllc.lucille.stage.TruncateField
    name: truncate_summary
    fields: ["summary", "description"]
    maxLength: 500

ExtractFirstCharacter

Extracts the first character from field values.

source

List<String>

required

List of source field names.

dest

List<String>

required

List of destination field names.

updateMode

String

default:"overwrite"

How to handle existing destination field values.

Example: Create Alphabetical Index

stages:
  - class: com.kmwllc.lucille.stage.ExtractFirstCharacter
    name: create_alpha_index
    source: ["last_name"]
    dest: ["last_name_initial"]
    # "Smith" → "S"

ApplyJSoup

Parses HTML content using JSoup and extracts elements based on CSS selectors.

source

String

required

Source field containing HTML content.

dest

String

required

Destination field for extracted content.

selector

String

required

CSS selector to identify elements to extract.

attribute

String

HTML attribute to extract. If not specified, extracts text content.

updateMode

String

default:"overwrite"

How to handle existing destination field values.

Example: Extract Links

stages:
  - class: com.kmwllc.lucille.stage.ApplyJSoup
    name: extract_links
    source: html_content
    dest: links
    selector: "a"
    attribute: "href"

Example: Extract Article Text

stages:
  - class: com.kmwllc.lucille.stage.ApplyJSoup
    name: extract_article
    source: html
    dest: article_text
    selector: "article.main-content"

XPathExtractor

Extracts data from XML using XPath expressions.

source

String

required

Source field containing XML content.

xpathMapping

Map<String, String>

required

Mapping of XPath expressions to destination field names.

updateMode

String

default:"overwrite"

How to handle existing destination field values.

Example: Extract XML Metadata

stages:
  - class: com.kmwllc.lucille.stage.XPathExtractor
    name: extract_metadata
    source: xml_content
    xpathMapping:
      "/book/title": book_title
      "/book/author": book_author
      "/book/isbn": isbn

Connectors

Stages

Indexers

Plugins

Text Processing Stages

Overview

ApplyRegex

Example: Extract Email Addresses

Example: Extract with Capturing Group

Concatenate

Example: Create Display Name

Example: Format Address

NormalizeText

Example: Normalize for Search

Example: Title Case Formatting

TrimWhitespace

Example: Clean Input Fields

ReplacePatterns

Example: Clean HTML

Example: Normalize Whitespace

RemoveDiacritics

Example: Normalize International Text

SplitFieldValues

Example: Split Tags

Example: Split by Whitespace

TruncateField

Example: Limit Description Length

ExtractFirstCharacter

Example: Create Alphabetical Index

ApplyJSoup

Example: Extract Links

Example: Extract Article Text

XPathExtractor

Example: Extract XML Metadata

Connectors

Stages

Indexers

Plugins

​Overview

​ApplyRegex

​Example: Extract Email Addresses

​Example: Extract with Capturing Group

​Concatenate

​Example: Create Display Name

​Example: Format Address

​NormalizeText

​Example: Normalize for Search

​Example: Title Case Formatting

​TrimWhitespace

​Example: Clean Input Fields

​ReplacePatterns

​Example: Clean HTML

​Example: Normalize Whitespace

​RemoveDiacritics

​Example: Normalize International Text

​SplitFieldValues

​Example: Split Tags

​Example: Split by Whitespace

​TruncateField

​Example: Limit Description Length

​ExtractFirstCharacter

​Example: Create Alphabetical Index

​ApplyJSoup

​Example: Extract Links

​Example: Extract Article Text

​XPathExtractor

​Example: Extract XML Metadata

Overview

ApplyRegex

Example: Extract Email Addresses

Example: Extract with Capturing Group

Concatenate

Example: Create Display Name

Example: Format Address

NormalizeText

Example: Normalize for Search

Example: Title Case Formatting

TrimWhitespace

Example: Clean Input Fields

ReplacePatterns

Example: Clean HTML

Example: Normalize Whitespace

RemoveDiacritics

Example: Normalize International Text

SplitFieldValues

Example: Split Tags

Example: Split by Whitespace

TruncateField

Example: Limit Description Length

ExtractFirstCharacter

Example: Create Alphabetical Index

ApplyJSoup

Example: Extract Links

Example: Extract Article Text

XPathExtractor

Example: Extract XML Metadata