Overview
Enrichment stages enhance documents by adding information from external sources like dictionaries, databases, APIs, and specialized detection algorithms. These stages are crucial for adding context and metadata to your documents.DetectLanguage
Detects the language of text content using langdetect library. Supports 53 languages.List of source fields containing text to analyze for language detection.
Field to store the detected language code (e.g., “en”, “es”, “fr”).
Field to store the confidence score of the language detection.
Minimum length of text to consider for language detection. Shorter strings are ignored.
Maximum length of text to consider. Longer strings are truncated.
Minimum confidence threshold for accepting a detection result. Results below this threshold are ignored.
How to handle existing field values:
overwrite, append, or skip.Supported Languages
Supports 53 languages including: English (en), Spanish (es), French (fr), German (de), Chinese (zh-cn, zh-tw), Japanese (ja), Korean (ko), Arabic (ar), Russian (ru), Portuguese (pt), Italian (it), Dutch (nl), and many more.Example: Basic Language Detection
Example: Multi-field Language Detection
DictionaryLookup
Finds exact matches in a dictionary file and extracts associated payloads. Can also function as a set membership test.List of source field names to check against the dictionary.
List of destination field names. Can be:
- Same length as source for 1-to-1 mapping
- Single field to collect results from all source fields
Path to the dictionary file. Dictionary format: one term per line, optionally with payload:
term, payload- If path starts with
classpath:, searches the classpath - Otherwise, searches the local filesystem
- Supports S3, Azure, and GCP paths with appropriate configuration
Whether to use payloads from the dictionary. If false, outputs the matched term instead of its payload.
Whether to perform case-insensitive matching.
Use as a set membership test. When true:
- Destination field is set to boolean true/false
updateModemust beoverwrite- Use with
useAnyMatchandignoreMissingSourcefor fine control
Only valid with
setOnly=true. If true, returns true if ANY value matches. If false, returns true only if ALL values match.Only valid with
setOnly=true. If true, treats missing source fields as a match.How to handle existing destination field values. Cannot be used with
setOnly=true.Example: Category Tagging
category_mapping.txt):
Example: Set Membership Check
Example: S3 Dictionary
QueryDatabase
Executs SQL queries against a database to enrich documents.JDBC connection URL for the database.
SQL query to execute. Can use
? placeholders for parameterized queries.List of field names to use as query parameters, corresponding to
? placeholders.Mapping of column names from query results to document field names.
Database username.
Database password.
Example: User Lookup
ElasticsearchLookup
Performs lookups against an Elasticsearch index to enrich documents.Elasticsearch host URL.
Index name to query.
Document field to use as the query term.
Elasticsearch field to search against.
Mapping of Elasticsearch fields to document fields.
Example: Product Enrichment
QueryOpensearch
Performs lookups against an OpenSearch index to enrich documents.OpenSearch endpoint URL.
Index name to query.
Document field to use for the query.
OpenSearch field to match against.
Mapping of OpenSearch response fields to document fields.
AWS region (for AWS OpenSearch).
Example: Document Enrichment from OpenSearch
ExtractEntities
Extracts named entities (person names, locations, organizations) using OpenNLP.Source field containing text to extract entities from.
Destination field for person names.
Destination field for location names.
Destination field for organization names.
How to handle existing field values.
Example: Extract Named Entities
ExtractEntitiesFST
Extracts entities using Finite State Transducer (FST) for fast dictionary-based entity recognition.Source field containing text.
Destination field for extracted entities.
Path to the FST dictionary file.
How to handle existing field values.
Example: Fast Entity Extraction
FetchUri
Fetches content from URLs specified in document fields.Field containing the URL to fetch.
Destination field for fetched content.
Request timeout in milliseconds.
HTTP headers to include in the request.
Example: Fetch Web Content
FetchFileContent
Fetches file content from local filesystem or cloud storage.Field containing the file path.
Destination field for file content.
Character encoding for text files.
Example: Load File Content
ApplyFileHandlers
Applies file handlers to extract content and metadata from files (PDF, Word, images, etc.) using Apache Tika.Field containing the file path or URL.
Destination field for extracted text content.
Prefix for metadata fields extracted from the file.
Example: Extract PDF Content
CreateStaticTeaser
Creates a teaser (snippet/summary) from text content.Source field containing full text.
Destination field for the teaser.
Maximum length of the teaser in characters.
Whether to break on word boundaries to avoid cutting words in half.
Example: Generate Summary
MatchQuery
Matches field values against a Lucene query string.Lucene query string to match against.
Field to store the boolean match result.
Example: Content Classification
Contains
Checks if field values contain specific substrings.List of source fields to check.
Destination field for boolean result.
List of values to search for.
Whether to perform case-insensitive matching.