OCR Plugin

Overview

The OCR plugin provides the ApplyOCR stage, which uses Tesseract OCR to extract text from images and PDFs. It supports full-page text extraction as well as template-based form extraction for structured documents. Maven Module: lucille-ocr Java Class: com.kmwllc.lucille.ocr.stage.ApplyOCR Source: ApplyOCR.java

Installation

Maven Dependency

<dependency>
  <groupId>com.kmwllc</groupId>
  <artifactId>lucille-ocr</artifactId>
  <version>${lucille.version}</version>
</dependency>

Tesseract Installation

Tesseract must be installed on the system:

Ubuntu/Debian
macOS
Windows

sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-eng  # English

brew install tesseract
brew install tesseract-lang  # Additional languages

Language Data

Place Tesseract language data in a TesseractOcr directory:

mkdir -p TesseractOcr
cd TesseractOcr
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata

Configuration

Basic Full-Page Extraction

stage {
  class: "com.kmwllc.lucille.ocr.stage.ApplyOCR"
  lang: "eng"
  pathField: "image_path"
  extractAllDest: "ocr_text"
}

Template-Based Form Extraction

stage {
  class: "com.kmwllc.lucille.ocr.stage.ApplyOCR"
  lang: "eng"
  pathField: "form_path"
  
  extractionTemplates: [
    {
      name: "w2_form"
      regions: [
        {
          x: 100
          y: 200
          width: 300
          height: 50
          dest: "employee_name"
        },
        {
          x: 100
          y: 300
          width: 200
          height: 40
          dest: "ssn"
        }
      ]
    }
  ]
  
  pages: {
    "0": "w2_form"
  }
}

Parameters

lang

string

required

Tesseract language code. Must match installed language data.Examples: "eng", "fra", "deu", "spa"

pathField

string

required

Document field containing the path to the image or PDF file. If the document lacks this field, it is skipped.

extractAllDest

string

If specified, OCR is applied to the entire image and results are stored in this field.For PDFs, OCR is applied to each page separately, creating a multi-valued field.

extractionTemplates

FormTemplate[]

List of form templates defining regions to extract. Each template contains:

name (string): Template identifier
regions (Rectangle[]): List of regions to extract
- x (number): X coordinate in pixels
- y (number): Y coordinate in pixels
- width (number): Region width in pixels
- height (number): Region height in pixels
- dest (string): Destination field for extracted text

pages

Map<Integer, String>

Static mapping from page numbers to template names. Page 0 represents the only page for non-PDF files.Example: {"0": "w2_form", "1": "w2_form", "2": "summary_page"}

pagesField

string

Document field containing a JSON string that dynamically maps page numbers to template names.Allows different documents to specify their own page-template mappings. Dynamic mappings override static ones.Example JSON: {"0": "invoice", "1": "terms"}

Features

Full-Page Text Extraction

Extract all text from an image or PDF:

extractAllDest: "full_text"

For images: Single-valued field with extracted text For PDFs: Multi-valued field with one entry per page

Template-Based Form Extraction

Extract specific regions from structured forms:

extractionTemplates: [
  {
    name: "invoice"
    regions: [
      {x: 50, y: 100, width: 200, height: 30, dest: "invoice_number"},
      {x: 50, y: 150, width: 200, height: 30, dest: "invoice_date"},
      {x: 50, y: 500, width: 150, height: 40, dest: "total_amount"}
    ]
  }
]

Regions are extracted separately and stored in their respective destination fields.

PDF Support

PDFs are automatically detected by file extension:

Each page is rendered to an image at 300 DPI
OCR is applied to each page
Results are combined into multi-valued fields

Static and Dynamic Template Assignment

Static Assignment
Dynamic Assignment
Combined

Define in configuration:

pages: {
  "0": "form_template"
  "1": "form_template"
  "2": "signature_page"
}

Include in document:

pagesField: "template_mapping"

Document contains:

{
  "id": "doc1",
  "template_mapping": "{\"0\": \"w2\", \"1\": \"w2\"}"
}

Dynamic mappings override static:

pages: {"0": "default_template"}
pagesField: "custom_mapping"

If document has custom_mapping, it takes precedence.

Example Configurations

Simple image text extraction

pipeline {
  stages: [
    {
      class: "com.kmwllc.lucille.ocr.stage.ApplyOCR"
      lang: "eng"
      pathField: "file_path"
      extractAllDest: "ocr_content"
    }
  ]
}

W-2 tax form extraction

pipeline {
  stages: [
    {
      class: "com.kmwllc.lucille.ocr.stage.ApplyOCR"
      lang: "eng"
      pathField: "pdf_path"
      
      extractionTemplates: [
        {
          name: "w2"
          regions: [
            {x: 200, y: 150, width: 400, height: 50, dest: "employer_name"},
            {x: 200, y: 250, width: 300, height: 40, dest: "employee_ssn"},
            {x: 650, y: 350, width: 150, height: 40, dest: "wages"},
            {x: 650, y: 400, width: 150, height: 40, dest: "federal_tax"}
          ]
        }
      ]
      
      pages: {
        "0": "w2"
      }
    }
  ]
}

Multi-page invoice with dynamic templates

pipeline {
  stages: [
    {
      class: "com.kmwllc.lucille.ocr.stage.ApplyOCR"
      lang: "eng"
      pathField: "invoice_pdf"
      pagesField: "page_templates"
      extractAllDest: "full_invoice_text"
      
      extractionTemplates: [
        {
          name: "invoice_header"
          regions: [
            {x: 100, y: 100, width: 300, height: 40, dest: "invoice_number"},
            {x: 500, y: 100, width: 200, height: 40, dest: "invoice_date"}
          ]
        },
        {
          name: "invoice_summary"
          regions: [
            {x: 600, y: 700, width: 150, height: 50, dest: "total"}
          ]
        }
      ]
    }
  ]
}

Document includes:

{
  "invoice_pdf": "/data/invoice-1234.pdf",
  "page_templates": "{\"0\": \"invoice_header\", \"1\": \"invoice_summary\"}"
}

Multi-language extraction

pipeline {
  stages: [
    {
      class: "com.kmwllc.lucille.ocr.stage.ApplyOCR"
      lang: "fra"  # French
      pathField: "document_path"
      extractAllDest: "text_french"
    }
  ]
}

Coordinate System

Template regions use pixel coordinates:

Origin: Top-left corner (0, 0)
X axis: Increases to the right
Y axis: Increases downward
Resolution: 300 DPI for PDFs

(0,0) -------- X ------>
  |
  |
  Y
  |
  v
  
  Region: {x: 100, y: 200, width: 300, height: 50}

Best Practices

Use high-quality images

OCR accuracy depends on image quality:

Minimum 300 DPI resolution
Clear, high-contrast text
Minimal skew or rotation
Good lighting (for scanned documents)

Test template coordinates carefully

Use an image viewer to determine exact pixel coordinates
Account for page margins and headers
Test with multiple sample documents
Handle slight variations in form layout

Choose the right language model

Install language-specific trained data
Use eng for English, fra for French, etc.
For mixed languages, consider running multiple OCR stages

Handle PDF multi-page results

When using extractAllDest with PDFs:

Field becomes multi-valued (one per page)
Use SetField or Join stages to combine if needed
Or process pages individually with templates

Clean up temp files

The stage creates temporary files in lucille-ocr-temp/:

Files are cleaned up automatically
Monitor disk space for high-volume processing
Consider separate temp directory for production

Troubleshooting

Tesseract not found

Unable to load tesseract model: eng

Solution:

Verify Tesseract is installed: tesseract --version
Check language data exists in TesseractOcr/ directory
Download from: https://github.com/tesseract-ocr/tessdata

Poor OCR accuracy

Causes:

Low image resolution (< 300 DPI)
Poor contrast or lighting
Skewed or rotated text
Wrong language model

Solutions:

Pre-process images to improve quality
Use higher resolution scans
Apply deskewing before OCR
Verify correct language is selected

Empty extraction results

Causes:

Incorrect template coordinates
Page number mismatch
Template not assigned to page

Solutions:

Verify coordinates with image viewer
Check page numbering (starts at 0)
Ensure template name matches configuration
Test with extractAllDest first to confirm OCR works

PDF rendering issues

Error while loading file: IOException

Solutions:

Verify PDF is not corrupted
Check file permissions
Ensure sufficient disk space for rendering
Try converting PDF to images first

Performance Considerations

CPU intensive: OCR is computationally expensive
Temp files: Creates temporary PNG files during processing
Memory: PDF rendering requires significant memory
Parallelism: Stage can run in parallel across workers

Optimization tips:

Process small batches of documents
Use multiple workers for parallel processing
Monitor CPU and disk I/O
Consider pre-processing images to optimal size

Connectors

Stages

Indexers

Plugins

Overview

Installation

Maven Dependency

Tesseract Installation

Language Data

Configuration

Basic Full-Page Extraction

Template-Based Form Extraction

Parameters

Features

Full-Page Text Extraction

Template-Based Form Extraction

PDF Support

Static and Dynamic Template Assignment

Example Configurations

Coordinate System

Best Practices

Troubleshooting

Performance Considerations

See Also

Connectors

Stages

Indexers

Plugins

​Overview

​Installation

​Maven Dependency

​Tesseract Installation

​Language Data

​Configuration

​Basic Full-Page Extraction

​Template-Based Form Extraction

​Parameters

​Features

​Full-Page Text Extraction

​Template-Based Form Extraction

​PDF Support

​Static and Dynamic Template Assignment

​Example Configurations

​Coordinate System

​Best Practices

​Troubleshooting

​Performance Considerations

​See Also

Overview

Installation

Maven Dependency

Tesseract Installation

Language Data

Configuration

Basic Full-Page Extraction

Template-Based Form Extraction

Parameters

Features

Full-Page Text Extraction

Template-Based Form Extraction

PDF Support

Static and Dynamic Template Assignment

Example Configurations

Coordinate System

Best Practices

Troubleshooting

Performance Considerations

See Also