S3 Ingestion Guide

This guide demonstrates how to ingest files from Amazon S3, extract text using Apache Tika, chunk the content for RAG applications, and index to OpenSearch.

Overview

This example shows a complete RAG (Retrieval-Augmented Generation) ingestion pipeline:

Read files from S3 buckets
Extract text from various file formats (PDF, Word, HTML, etc.)
Chunk text into smaller segments
Index to OpenSearch for vector search

Prerequisites

AWS Credentials

Ensure you have AWS credentials configured:

export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"

OpenSearch Setup

Start OpenSearch locally using Docker:

docker-compose up -d

Or follow the OpenSearch Getting Started Guide.

S3 Bucket

Create an S3 bucket and upload documents:

aws s3 mb s3://my-test-bucket
aws s3 cp documents/ s3://my-test-bucket/folder1/ --recursive

Configuration

Full Pipeline Configuration

connectors: [
  {
    name: "fileConnector",
    class: "com.kmwllc.lucille.connector.FileConnector",
    pipeline: "pipeline1",
    
    # S3 path to your files
    paths: [${?PATH_TO_STORAGE}]
    
    fileOptions: {
      getFileContent: true
    }
    
    # S3 configuration
    s3: {
      region: ${?AWS_REGION}
      accessKeyId: ${?AWS_ACCESS_KEY_ID}
      secretAccessKey: ${?AWS_SECRET_ACCESS_KEY}
    }
  }
]

pipelines: [
  {
    name: "pipeline1",
    stages: [
      {
        name: "TextExtractor"
        class: "com.kmwllc.lucille.tika.stage.TextExtractor"
        # FileConnector places content in this field
        byteArrayField: "file_content"
        metadataPrefix: ""
        metadataBlacklist: []
        tikaConfigPath: "conf/tika-config.xml"
      },
      {
        name: "ChunkText"
        class: "com.kmwllc.lucille.stage.ChunkText"
        source: "text"
        dest: "text"
        chunkingMethod: "paragraph"
      },
      {
        name: "EmitNestedChildren"
        class: "com.kmwllc.lucille.stage.EmitNestedChildren"
        dropParent: "false"
      },
      {
        name: "DeleteFields"
        class: "com.kmwllc.lucille.stage.DeleteFields"
        fields: ["file_content", "_version"]
      }   
    ]
  }
]

indexer {
  type: "OpenSearch"
  batchTimeout: 1000
  batchSize: 100
  sendEnabled: true
}

opensearch {
  url: ${?INGEST_URL}
  index: ${?INGEST_INDEX}
  acceptInvalidCert: true
}

The ${?VAR} syntax allows environment variable overrides while providing defaults for testing.

Pipeline Stages Explained

1. TextExtractor Stage

Extracts text from binary file content using Apache Tika:

{
  name: "TextExtractor"
  class: "com.kmwllc.lucille.tika.stage.TextExtractor"
  byteArrayField: "file_content"    # Input field with binary data
  metadataPrefix: ""                # Prefix for metadata fields
  tikaConfigPath: "conf/tika-config.xml"
}

Supported formats: PDF, Word, Excel, PowerPoint, HTML, plain text, images (with OCR), and 1000+ more.

2. ChunkText Stage

Splits text into smaller chunks for RAG applications:

{
  name: "ChunkText"
  class: "com.kmwllc.lucille.stage.ChunkText"
  source: "text"              # Field containing text to chunk
  dest: "text"                # Output field name in child docs
  chunkingMethod: "paragraph" # Options: paragraph, sentence, fixed, custom
}

For RAG applications, use paragraph chunking with a minimum chunk size:

chunkingMethod: "paragraph"
preMergeMinChunkLen: 100
characterLimit: 2000

3. EmitNestedChildren Stage

Converts attached child documents into separate documents:

{
  name: "EmitNestedChildren"
  class: "com.kmwllc.lucille.stage.EmitNestedChildren"
  dropParent: "false"  # Set to true to only index chunks, not parent
}

Each chunk becomes a separate document with metadata:

parent_id: Original document ID
chunk_number: Sequence number
total_chunks: Total number of chunks
offset: Character offset in original text

4. DeleteFields Stage

Cleans up unnecessary fields before indexing:

{
  name: "DeleteFields"
  class: "com.kmwllc.lucille.stage.DeleteFields"
  fields: ["file_content", "_version"]
}

Environment Variables

Create a .env file or export variables:

# S3 Configuration
export AWS_REGION="us-east-2"
export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"
export PATH_TO_STORAGE="s3://my-test-bucket/folder1"

# OpenSearch Configuration
export INGEST_URL="https://username:password@localhost:9200/"
export INGEST_INDEX="s3-docs"

Running the Ingestion

Build the Project

mvn clean package

Run the Script

./scripts/run_ingest.sh

The script runs:

java -Dconfig.file=conf/s3-opensearch.conf \
     -cp 'target/lib/*' \
     com.kmwllc.lucille.core.Runner

Monitor Progress

Watch the logs for progress:

INFO  FileConnector - Processing s3://bucket/file.pdf
INFO  TextExtractor - Extracted 2500 characters
INFO  ChunkText - Created 5 chunks
INFO  Indexer - Indexed 5 documents

Advanced Configurations

Sentence-Based Chunking with Overlap

For more granular control:

{
  name: "ChunkText"
  class: "com.kmwllc.lucille.stage.ChunkText"
  source: "text"
  dest: "text"
  chunkingMethod: "sentence"
  chunksToMerge: 5           # Merge 5 sentences per chunk
  chunksToOverlap: 1         # Overlap 1 sentence between chunks
  characterLimit: 2000       # Max characters per chunk
  cleanChunks: true          # Remove extra whitespace
}

Custom Tika Configuration

Create conf/tika-config.xml for OCR and advanced parsing:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
  </parsers>
  <detectors>
    <detector class="org.apache.tika.detect.DefaultDetector"/>
  </detectors>
</properties>

Filtering Files by Type

Process only specific file types:

fileOptions: {
  getFileContent: true
  includePatterns: [".*\\.pdf$", ".*\\.docx?$"]
  excludePatterns: [".*\\.tmp$"]
}

Adding Metadata Enrichment

Extract and transform metadata:

stages: [
  {
    name: "TextExtractor"
    class: "com.kmwllc.lucille.tika.stage.TextExtractor"
    byteArrayField: "file_content"
    metadataPrefix: "meta_"
  },
  {
    name: "CopyFields"
    class: "com.kmwllc.lucille.stage.CopyFields"
    fieldMapping: {
      "meta_author": "author"
      "meta_title": "title"
      "file_path": "source_path"
    }
  },
  // ... rest of pipeline
]

Chunking Strategies

Paragraph Chunking

Best for: Documents with clear paragraph structure

chunkingMethod: "paragraph"
preMergeMinChunkLen: 100

Sentence Chunking

Best for: Fine-grained control, QA systems

chunkingMethod: "sentence"
chunksToMerge: 5
chunksToOverlap: 1

Fixed Chunking

Best for: Consistent chunk sizes

chunkingMethod: "fixed"
lengthToSplit: 1000
overlapPercentage: 10

Custom Chunking

Best for: Domain-specific splitting

chunkingMethod: "custom"
regex: "\n\n+"

Indexing to OpenSearch

Create an Embedding Index

For vector search, create an index with embeddings:

PUT /s3-docs
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 768
      },
      "parent_id": { "type": "keyword" },
      "chunk_number": { "type": "integer" }
    }
  }
}

For production use, consider adding an embedding stage (like OpenAI or HuggingFace) before indexing.

Next Steps

Add Vector Search with embeddings
Learn about Custom Stages for domain-specific processing
Explore Document Generation for testing

Troubleshooting

S3 access denied

Check your AWS credentials and bucket permissions:

aws s3 ls s3://your-bucket/

Ensure your IAM user/role has s3:GetObject and s3:ListBucket permissions.

Out of memory errors

Reduce batch size or add JVM memory:

java -Xmx4g -Dconfig.file=conf/s3-opensearch.conf ...

Or process fewer files at once by being more specific with paths.

OpenSearch connection issues

Verify OpenSearch is accessible:

curl -k https://username:password@localhost:9200/

Set acceptInvalidCert: true for self-signed certificates.

Get Started

Core Concepts

Configuration

Deployment

Guides

S3 Ingestion Guide

Overview

Prerequisites

Configuration

Full Pipeline Configuration

Pipeline Stages Explained

1. TextExtractor Stage

2. ChunkText Stage

3. EmitNestedChildren Stage

4. DeleteFields Stage

Environment Variables

Running the Ingestion

Advanced Configurations

Sentence-Based Chunking with Overlap

Custom Tika Configuration

Filtering Files by Type

Adding Metadata Enrichment

Chunking Strategies

Paragraph Chunking

Sentence Chunking

Fixed Chunking

Custom Chunking

Indexing to OpenSearch

Create an Embedding Index

Next Steps

Troubleshooting

Get Started

Core Concepts

Configuration

Deployment

Guides

​Overview

​Prerequisites

​Configuration

​Full Pipeline Configuration

​Pipeline Stages Explained

​1. TextExtractor Stage

​2. ChunkText Stage

​3. EmitNestedChildren Stage

​4. DeleteFields Stage

​Environment Variables

​Running the Ingestion

​Advanced Configurations

​Sentence-Based Chunking with Overlap

​Custom Tika Configuration

​Filtering Files by Type

​Adding Metadata Enrichment

​Chunking Strategies

Paragraph Chunking

Sentence Chunking

Fixed Chunking

Custom Chunking

​Indexing to OpenSearch

​Create an Embedding Index

​Next Steps

​Troubleshooting

Overview

Prerequisites

Configuration

Full Pipeline Configuration

Pipeline Stages Explained

1. TextExtractor Stage

2. ChunkText Stage

3. EmitNestedChildren Stage

4. DeleteFields Stage

Environment Variables

Running the Ingestion

Advanced Configurations

Sentence-Based Chunking with Overlap

Custom Tika Configuration

Filtering Files by Type

Adding Metadata Enrichment

Chunking Strategies

Indexing to OpenSearch

Create an Embedding Index

Next Steps

Troubleshooting