Skip to main content
This guide demonstrates how to ingest files from Amazon S3, extract text using Apache Tika, chunk the content for RAG applications, and index to OpenSearch.

Overview

This example shows a complete RAG (Retrieval-Augmented Generation) ingestion pipeline:
  • Read files from S3 buckets
  • Extract text from various file formats (PDF, Word, HTML, etc.)
  • Chunk text into smaller segments
  • Index to OpenSearch for vector search

Prerequisites

1

AWS Credentials

Ensure you have AWS credentials configured:
export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
2

OpenSearch Setup

Start OpenSearch locally using Docker:
docker-compose up -d
Or follow the OpenSearch Getting Started Guide.
3

S3 Bucket

Create an S3 bucket and upload documents:
aws s3 mb s3://my-test-bucket
aws s3 cp documents/ s3://my-test-bucket/folder1/ --recursive

Configuration

Full Pipeline Configuration

connectors: [
  {
    name: "fileConnector",
    class: "com.kmwllc.lucille.connector.FileConnector",
    pipeline: "pipeline1",
    
    # S3 path to your files
    paths: [${?PATH_TO_STORAGE}]
    
    fileOptions: {
      getFileContent: true
    }
    
    # S3 configuration
    s3: {
      region: ${?AWS_REGION}
      accessKeyId: ${?AWS_ACCESS_KEY_ID}
      secretAccessKey: ${?AWS_SECRET_ACCESS_KEY}
    }
  }
]

pipelines: [
  {
    name: "pipeline1",
    stages: [
      {
        name: "TextExtractor"
        class: "com.kmwllc.lucille.tika.stage.TextExtractor"
        # FileConnector places content in this field
        byteArrayField: "file_content"
        metadataPrefix: ""
        metadataBlacklist: []
        tikaConfigPath: "conf/tika-config.xml"
      },
      {
        name: "ChunkText"
        class: "com.kmwllc.lucille.stage.ChunkText"
        source: "text"
        dest: "text"
        chunkingMethod: "paragraph"
      },
      {
        name: "EmitNestedChildren"
        class: "com.kmwllc.lucille.stage.EmitNestedChildren"
        dropParent: "false"
      },
      {
        name: "DeleteFields"
        class: "com.kmwllc.lucille.stage.DeleteFields"
        fields: ["file_content", "_version"]
      }   
    ]
  }
]

indexer {
  type: "OpenSearch"
  batchTimeout: 1000
  batchSize: 100
  sendEnabled: true
}

opensearch {
  url: ${?INGEST_URL}
  index: ${?INGEST_INDEX}
  acceptInvalidCert: true
}
The ${?VAR} syntax allows environment variable overrides while providing defaults for testing.

Pipeline Stages Explained

1. TextExtractor Stage

Extracts text from binary file content using Apache Tika:
{
  name: "TextExtractor"
  class: "com.kmwllc.lucille.tika.stage.TextExtractor"
  byteArrayField: "file_content"    # Input field with binary data
  metadataPrefix: ""                # Prefix for metadata fields
  tikaConfigPath: "conf/tika-config.xml"
}
Supported formats: PDF, Word, Excel, PowerPoint, HTML, plain text, images (with OCR), and 1000+ more.

2. ChunkText Stage

Splits text into smaller chunks for RAG applications:
{
  name: "ChunkText"
  class: "com.kmwllc.lucille.stage.ChunkText"
  source: "text"              # Field containing text to chunk
  dest: "text"                # Output field name in child docs
  chunkingMethod: "paragraph" # Options: paragraph, sentence, fixed, custom
}
For RAG applications, use paragraph chunking with a minimum chunk size:
chunkingMethod: "paragraph"
preMergeMinChunkLen: 100
characterLimit: 2000

3. EmitNestedChildren Stage

Converts attached child documents into separate documents:
{
  name: "EmitNestedChildren"
  class: "com.kmwllc.lucille.stage.EmitNestedChildren"
  dropParent: "false"  # Set to true to only index chunks, not parent
}
Each chunk becomes a separate document with metadata:
  • parent_id: Original document ID
  • chunk_number: Sequence number
  • total_chunks: Total number of chunks
  • offset: Character offset in original text

4. DeleteFields Stage

Cleans up unnecessary fields before indexing:
{
  name: "DeleteFields"
  class: "com.kmwllc.lucille.stage.DeleteFields"
  fields: ["file_content", "_version"]
}

Environment Variables

Create a .env file or export variables:
# S3 Configuration
export AWS_REGION="us-east-2"
export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"
export PATH_TO_STORAGE="s3://my-test-bucket/folder1"

# OpenSearch Configuration
export INGEST_URL="https://username:password@localhost:9200/"
export INGEST_INDEX="s3-docs"

Running the Ingestion

1

Build the Project

mvn clean package
2

Run the Script

./scripts/run_ingest.sh
The script runs:
java -Dconfig.file=conf/s3-opensearch.conf \
     -cp 'target/lib/*' \
     com.kmwllc.lucille.core.Runner
3

Monitor Progress

Watch the logs for progress:
INFO  FileConnector - Processing s3://bucket/file.pdf
INFO  TextExtractor - Extracted 2500 characters
INFO  ChunkText - Created 5 chunks
INFO  Indexer - Indexed 5 documents

Advanced Configurations

Sentence-Based Chunking with Overlap

For more granular control:
{
  name: "ChunkText"
  class: "com.kmwllc.lucille.stage.ChunkText"
  source: "text"
  dest: "text"
  chunkingMethod: "sentence"
  chunksToMerge: 5           # Merge 5 sentences per chunk
  chunksToOverlap: 1         # Overlap 1 sentence between chunks
  characterLimit: 2000       # Max characters per chunk
  cleanChunks: true          # Remove extra whitespace
}

Custom Tika Configuration

Create conf/tika-config.xml for OCR and advanced parsing:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
  </parsers>
  <detectors>
    <detector class="org.apache.tika.detect.DefaultDetector"/>
  </detectors>
</properties>

Filtering Files by Type

Process only specific file types:
fileOptions: {
  getFileContent: true
  includePatterns: [".*\\.pdf$", ".*\\.docx?$"]
  excludePatterns: [".*\\.tmp$"]
}

Adding Metadata Enrichment

Extract and transform metadata:
stages: [
  {
    name: "TextExtractor"
    class: "com.kmwllc.lucille.tika.stage.TextExtractor"
    byteArrayField: "file_content"
    metadataPrefix: "meta_"
  },
  {
    name: "CopyFields"
    class: "com.kmwllc.lucille.stage.CopyFields"
    fieldMapping: {
      "meta_author": "author"
      "meta_title": "title"
      "file_path": "source_path"
    }
  },
  // ... rest of pipeline
]

Chunking Strategies

Paragraph Chunking

Best for: Documents with clear paragraph structure
chunkingMethod: "paragraph"
preMergeMinChunkLen: 100

Sentence Chunking

Best for: Fine-grained control, QA systems
chunkingMethod: "sentence"
chunksToMerge: 5
chunksToOverlap: 1

Fixed Chunking

Best for: Consistent chunk sizes
chunkingMethod: "fixed"
lengthToSplit: 1000
overlapPercentage: 10

Custom Chunking

Best for: Domain-specific splitting
chunkingMethod: "custom"
regex: "\n\n+"

Indexing to OpenSearch

Create an Embedding Index

For vector search, create an index with embeddings:
PUT /s3-docs
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 768
      },
      "parent_id": { "type": "keyword" },
      "chunk_number": { "type": "integer" }
    }
  }
}
For production use, consider adding an embedding stage (like OpenAI or HuggingFace) before indexing.

Next Steps

Troubleshooting

Check your AWS credentials and bucket permissions:
aws s3 ls s3://your-bucket/
Ensure your IAM user/role has s3:GetObject and s3:ListBucket permissions.
Reduce batch size or add JVM memory:
java -Xmx4g -Dconfig.file=conf/s3-opensearch.conf ...
Or process fewer files at once by being more specific with paths.
Verify OpenSearch is accessible:
curl -k https://username:password@localhost:9200/
Set acceptInvalidCert: true for self-signed certificates.