Overview
The OCR plugin provides theApplyOCR stage, which uses Tesseract OCR to extract text from images and PDFs. It supports full-page text extraction as well as template-based form extraction for structured documents.
Maven Module: lucille-ocr
Java Class: com.kmwllc.lucille.ocr.stage.ApplyOCR
Source: ApplyOCR.java
Installation
Maven Dependency
Tesseract Installation
Tesseract must be installed on the system:- Ubuntu/Debian
- macOS
- Windows
Language Data
Place Tesseract language data in aTesseractOcr directory:
Configuration
Basic Full-Page Extraction
Template-Based Form Extraction
Parameters
Tesseract language code. Must match installed language data.Examples:
"eng", "fra", "deu", "spa"Document field containing the path to the image or PDF file. If the document lacks this field, it is skipped.
If specified, OCR is applied to the entire image and results are stored in this field.For PDFs, OCR is applied to each page separately, creating a multi-valued field.
List of form templates defining regions to extract. Each template contains:
name(string): Template identifierregions(Rectangle[]): List of regions to extractx(number): X coordinate in pixelsy(number): Y coordinate in pixelswidth(number): Region width in pixelsheight(number): Region height in pixelsdest(string): Destination field for extracted text
Static mapping from page numbers to template names. Page 0 represents the only page for non-PDF files.Example:
{"0": "w2_form", "1": "w2_form", "2": "summary_page"}Document field containing a JSON string that dynamically maps page numbers to template names.Allows different documents to specify their own page-template mappings. Dynamic mappings override static ones.Example JSON:
{"0": "invoice", "1": "terms"}Features
Full-Page Text Extraction
Extract all text from an image or PDF:Template-Based Form Extraction
Extract specific regions from structured forms:PDF Support
PDFs are automatically detected by file extension:- Each page is rendered to an image at 300 DPI
- OCR is applied to each page
- Results are combined into multi-valued fields
Static and Dynamic Template Assignment
- Static Assignment
- Dynamic Assignment
- Combined
Define in configuration:
Example Configurations
Simple image text extraction
Simple image text extraction
W-2 tax form extraction
W-2 tax form extraction
Multi-page invoice with dynamic templates
Multi-page invoice with dynamic templates
Multi-language extraction
Multi-language extraction
Coordinate System
Template regions use pixel coordinates:- Origin: Top-left corner (0, 0)
- X axis: Increases to the right
- Y axis: Increases downward
- Resolution: 300 DPI for PDFs
Best Practices
Use high-quality images
Use high-quality images
OCR accuracy depends on image quality:
- Minimum 300 DPI resolution
- Clear, high-contrast text
- Minimal skew or rotation
- Good lighting (for scanned documents)
Test template coordinates carefully
Test template coordinates carefully
- Use an image viewer to determine exact pixel coordinates
- Account for page margins and headers
- Test with multiple sample documents
- Handle slight variations in form layout
Choose the right language model
Choose the right language model
- Install language-specific trained data
- Use
engfor English,frafor French, etc. - For mixed languages, consider running multiple OCR stages
Handle PDF multi-page results
Handle PDF multi-page results
When using
extractAllDest with PDFs:- Field becomes multi-valued (one per page)
- Use
SetFieldorJoinstages to combine if needed - Or process pages individually with templates
Clean up temp files
Clean up temp files
The stage creates temporary files in
lucille-ocr-temp/:- Files are cleaned up automatically
- Monitor disk space for high-volume processing
- Consider separate temp directory for production
Troubleshooting
Tesseract not found
Tesseract not found
- Verify Tesseract is installed:
tesseract --version - Check language data exists in
TesseractOcr/directory - Download from: https://github.com/tesseract-ocr/tessdata
Poor OCR accuracy
Poor OCR accuracy
Causes:
- Low image resolution (< 300 DPI)
- Poor contrast or lighting
- Skewed or rotated text
- Wrong language model
- Pre-process images to improve quality
- Use higher resolution scans
- Apply deskewing before OCR
- Verify correct language is selected
Empty extraction results
Empty extraction results
Causes:
- Incorrect template coordinates
- Page number mismatch
- Template not assigned to page
- Verify coordinates with image viewer
- Check page numbering (starts at 0)
- Ensure template name matches configuration
- Test with
extractAllDestfirst to confirm OCR works
PDF rendering issues
PDF rendering issues
- Verify PDF is not corrupted
- Check file permissions
- Ensure sufficient disk space for rendering
- Try converting PDF to images first
Performance Considerations
- CPU intensive: OCR is computationally expensive
- Temp files: Creates temporary PNG files during processing
- Memory: PDF rendering requires significant memory
- Parallelism: Stage can run in parallel across workers
- Process small batches of documents
- Use multiple workers for parallel processing
- Monitor CPU and disk I/O
- Consider pre-processing images to optimal size
See Also
- Tika Plugin - Extract text from documents
- Plugins Overview
- Tesseract OCR Documentation