Overview
Text processing stages manipulate string content in document fields. These stages handle common text operations like pattern matching, normalization, concatenation, and formatting.ApplyRegex
Extracts text based on regular expressions. Supports capturing groups and multiple regex flags.List of source field names to apply regex extraction to.
List of destination field names. Can be:
- Same length as source for 1-to-1 mapping
- Single field to collect results from all source fields
Regular expression to find matches. If the regex includes capturing groups, the value of the first group will be used.
How to handle existing destination field values:
overwrite, append, or skip.Whether the regex matcher should ignore case.
Whether the regex matcher should allow matches across multiple lines.
Enables DOTALL functionality for the regex matcher (
. matches newlines).Treats the regex expression as a literal string rather than a pattern.
Example: Extract Email Addresses
Example: Extract with Capturing Group
Concatenate
Replaces wildcards in a format string with field values to create concatenated output.Destination field for the concatenated result.
Format string with field placeholders wrapped in
{}.Example: "{city}, {state}, {country}" → "Boston, MA, USA"Mapping of field names to default values. If a field is missing or empty, the default value will be used. If no default is provided, the placeholder remains in the output.
How to handle existing destination field values:
overwrite, append, or skip.If a field is multivalued, only the first value will be used in the concatenation.
Example: Create Display Name
Example: Format Address
NormalizeText
Normalizes text case using four different modes: lowercase, uppercase, title case, and sentence case.List of source field names.
List of destination field names. Can be:
- Same length as source for 1-to-1 mapping
- Single field to collect results from all source fields
Normalization mode:
lowercase- Convert all text to lowercaseuppercase- Convert all text to UPPERCASEtitle_case- Convert To Title Case (first letter of each word capitalized)sentence_case- Convert to sentence case (first letter of each sentence capitalized)
How to handle existing destination field values:
overwrite, append, or skip.Example: Normalize for Search
Example: Title Case Formatting
TrimWhitespace
Removes leading and trailing whitespace from field values.List of fields to trim whitespace from. At least one field must be specified.
Example: Clean Input Fields
ReplacePatterns
Replaces text patterns using regular expressions with specified replacement strings.List of source field names.
List of destination field names. Must be same length as source or single field.
List of regex patterns to search for.
List of replacement strings corresponding to patterns. Must be same length as patterns.
How to handle existing destination field values.
Example: Clean HTML
Example: Normalize Whitespace
RemoveDiacritics
Removes diacritical marks (accents) from text, converting characters to their base forms.List of fields to remove diacritics from.
Example: Normalize International Text
SplitFieldValues
Splits field values into multiple values using a delimiter or regex pattern.List of fields to split.
String delimiter to split on. Either delimiter or regex must be specified.
Regular expression pattern to split on. Either delimiter or regex must be specified.
Example: Split Tags
Example: Split by Whitespace
TruncateField
Truncates field values to a maximum length.List of fields to truncate.
Maximum length for field values. Values longer than this will be truncated.
Example: Limit Description Length
ExtractFirstCharacter
Extracts the first character from field values.List of source field names.
List of destination field names.
How to handle existing destination field values.
Example: Create Alphabetical Index
ApplyJSoup
Parses HTML content using JSoup and extracts elements based on CSS selectors.Source field containing HTML content.
Destination field for extracted content.
CSS selector to identify elements to extract.
HTML attribute to extract. If not specified, extracts text content.
How to handle existing destination field values.
Example: Extract Links
Example: Extract Article Text
XPathExtractor
Extracts data from XML using XPath expressions.Source field containing XML content.
Mapping of XPath expressions to destination field names.
How to handle existing destination field values.