Overview
The Tika plugin provides theTextExtractor stage, which uses Apache Tika to extract text content and metadata from a wide variety of document formats including PDF, Word, Excel, PowerPoint, HTML, and many others.
Maven Module: lucille-tika
Java Class: com.kmwllc.lucille.tika.stage.TextExtractor
Source: TextExtractor.java
Installation
Add the plugin dependency to yourpom.xml:
Configuration
Basic Configuration
Extract from Byte Array
Parameters
Destination field for extracted text content.
Document field containing the file path to extract from. Mutually exclusive with
byteArrayField.Example: "file_path", "document_url"Document field containing byte array data to extract from. Mutually exclusive with
filePathField.Example: "file_data", "content_bytes"Path to a custom Tika configuration XML file. If not provided, uses Tika’s default AutoDetectParser.Example:
"/etc/tika/tika-config.xml"Prefix to append to extracted metadata field names.Set to empty string (
"") for no prefix.Maximum number of characters to extract. Useful for preventing memory issues with very large documents.Example:
1000000 (1 million characters)Timeout for parsing in milliseconds. If parsing exceeds this time, it will be cancelled.Example:
30000 (30 seconds)List of metadata field names to include in the document. If specified, only these metadata fields are extracted.Mutually exclusive with
metadataBlacklist.List of metadata field names to exclude from the document.Mutually exclusive with
metadataWhitelist.Cloud Storage Support
The plugin supports reading files from cloud storage:- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
Features
Automatic Format Detection
Tika automatically detects document types based on:- File extension
- MIME type
- Content analysis (magic bytes)
- Documents: PDF, Word (DOC/DOCX), RTF, OpenDocument
- Spreadsheets: Excel (XLS/XLSX), CSV
- Presentations: PowerPoint (PPT/PPTX)
- Web: HTML, XML
- Archives: ZIP, TAR, GZIP
- Images: JPEG, PNG (with metadata)
- And many more…
Metadata Extraction
Tika extracts rich metadata from documents:tika_content_type: MIME typetika_title: Document titletika_author: Author nametika_creation_date: When document was createdtika_page_count: Number of pages- And many more format-specific fields
Metadata Filtering
Control which metadata fields are extracted:Content Limiting
Prevent memory issues with large documents:Parse Timeout
Prevent hanging on problematic documents:Field Name Cleaning
Metadata field names are automatically cleaned:- Converted to lowercase
- Spaces replaced with underscores
- Hyphens replaced with underscores
- Colons replaced with underscores
Content-Type→content_typeCreation Date→creation_datedc:creator→dc_creator
Example Configurations
Extract from local files
Extract from local files
Extract from S3 with metadata filtering
Extract from S3 with metadata filtering
Extract from byte arrays with custom Tika config
Extract from byte arrays with custom Tika config
Extract PDFs with timeout
Extract PDFs with timeout
Lifecycle Methods
start()
Initializes resources:- Creates storage clients (S3, Azure, GCP) if file path field is used
- Initializes Tika parser with config or default AutoDetectParser
- Creates executor service if parse timeout is configured
stop()
Cleans up resources:- Shuts down storage clients
- Terminates executor service
- Releases parser resources
Best Practices
Use parse timeouts for production
Use parse timeouts for production
Some documents may cause Tika to hang. Always set a reasonable timeout:
Limit text extraction for large documents
Limit text extraction for large documents
Prevent memory issues by limiting extracted text:
Filter metadata appropriately
Filter metadata appropriately
Extract only needed metadata to reduce document size:
Handle errors gracefully
Handle errors gracefully
Tika logs warnings for problematic files but continues processing. Monitor logs for:
TikaException: Parsing errorsIOException: File access errorsTimeoutException: Parse timeout exceeded
Choose file path vs byte array
Choose file path vs byte array
- File path: Better for large files (streaming)
- Byte array: Better when files are already in memory
- Never use both on the same stage
Troubleshooting
OutOfMemoryError
OutOfMemoryError
Reduce memory usage:
- Set
textContentLimitto a reasonable value - Increase JVM heap size:
-Xmx4g - Process large files in a separate pipeline with higher limits
Parsing hangs or times out
Parsing hangs or times out
- Set
parseTimeoutto prevent hanging - Some PDFs with complex structures may be slow
- Consider pre-processing problematic documents
Missing metadata
Missing metadata
- Check that metadata exists in source document
- Verify
metadataWhitelistincludes the field - Check
metadataBlacklistdoesn’t exclude it - Remember field names are cleaned (lowercase, underscores)
File not found errors
File not found errors
For cloud storage:
- Verify credentials are correct
- Check file path format (e.g.,
s3://bucket/key) - Ensure storage client is properly configured
- Test connectivity to cloud service
Supported File Formats
Tika supports hundreds of formats. Common ones include:- Documents: PDF, DOC, DOCX, ODT, RTF, TXT
- Spreadsheets: XLS, XLSX, ODS, CSV
- Presentations: PPT, PPTX, ODP
- Web: HTML, XHTML, XML
- Email: EML, MSG, PST
- Archives: ZIP, TAR, GZIP, RAR
- Images: JPEG, PNG, TIFF, GIF (metadata only)
- Audio/Video: MP3, MP4, AVI (metadata only)
Custom Tika Configuration
Create a custom Tika config XML for advanced use cases:See Also
- OCR Plugin - Extract text from images
- Plugins Overview
- Apache Tika Documentation