AI & Machine Learning Stages

Overview

AI and Machine Learning stages integrate powerful models for embeddings, text generation, chunking, and intelligent document processing. These stages enable advanced use cases like semantic search, RAG (Retrieval-Augmented Generation), and LLM-powered enrichment.

OpenAIEmbed

Generates vector embeddings using OpenAI’s embedding models. Supports both document-level and child document embeddings.

source

String

required

Field containing the text to embed.

dest

String

default:"embeddings"

Field to store the embedding vectors.

apiKey

String

required

OpenAI API key. Should be stored securely (e.g., environment variable).

embedDocument

Boolean

required

Whether to embed the document’s source field.

embedChildren

Boolean

required

Whether to embed the document’s children source fields.

modelName

String

default:"text-embedding-3-small"

OpenAI embedding model to use:

text-embedding-3-small - Fast, cost-effective (default)
text-embedding-3-large - Higher quality, larger dimensions
text-embedding-ada-002 - Legacy model

dimensions

Integer

Number of dimensions for the embedding vector. Only supported in text-embedding-3-* models. If not specified, uses the model’s default dimensions.

The stage automatically truncates text to OpenAI’s token limit (8191 tokens) to prevent API errors.

Example: Basic Document Embedding

stages:
  - class: com.kmwllc.lucille.stage.OpenAIEmbed
    name: embed_content
    source: content
    dest: content_vector
    apiKey: ${OPENAI_API_KEY}
    embedDocument: true
    embedChildren: false
    modelName: text-embedding-3-small

Example: Embed Document and Children

stages:
  - class: com.kmwllc.lucille.stage.ChunkText
    name: chunk_content
    source: content
    dest: text
    chunkingMethod: sentence
    chunksToMerge: 5
    characterLimit: 2000
  
  - class: com.kmwllc.lucille.stage.OpenAIEmbed
    name: embed_chunks
    source: text
    dest: embeddings
    apiKey: ${OPENAI_API_KEY}
    embedDocument: false
    embedChildren: true  # Embed the chunks created by ChunkText
    modelName: text-embedding-3-large
    dimensions: 1536

Example: Semantic Search Pipeline

stages:
  - class: com.kmwllc.lucille.stage.OpenAIEmbed
    name: create_embeddings
    source: combined_text
    dest: search_vector
    apiKey: ${OPENAI_API_KEY}
    embedDocument: true
    embedChildren: false
    modelName: text-embedding-3-small
    dimensions: 512  # Smaller dimensions for faster search

ChunkText

Splits text into smaller chunks for processing by LLMs or embedding models. Produces child documents containing the chunks.

source

String

required

Field containing the text to chunk.

dest

String

default:"text"

Field name for chunk content in child documents.

chunkingMethod

String

default:"sentence"

How to split the text:

sentence - Use OpenNLP sentence detection (default)
paragraph - Split on consecutive line breaks
fixed - Split by character count
custom - Use custom regex pattern

regex

String

Custom regex pattern for splitting. Required when chunkingMethod=custom.

lengthToSplit

Integer

Character length for each chunk. Required when chunkingMethod=fixed.

cleanChunks

Boolean

default:false

Whether to remove newlines and trim whitespace from chunks.

preMergeMinChunkLen

Integer

Minimum chunk length in characters. Smaller chunks are merged with neighboring chunks.

preMergeMaxChunkLen

Integer

Maximum chunk length before merging. Chunks are truncated to this length.

chunksToMerge

Integer

default:1

Number of initial chunks to merge into final chunks.Example: chunksToMerge=2 merges [chunk1, chunk2], [chunk3, chunk4], etc.

chunksToOverlap

Integer

Number of chunks to overlap when merging. Creates sliding window.Example: chunksToMerge=3, chunksToOverlap=1 creates [c1,c2,c3], [c3,c4,c5], [c5,c6,c7]Cannot be used with overlapPercentage.

overlapPercentage

Integer

default:0

Percentage of neighboring chunks to include as overlap.Cannot be used with chunksToOverlap.

characterLimit

Integer

Hard limit on final chunk size. Applied after all other processing.

This stage creates attached child documents. Use EmitNestedChildren stage afterward to emit children as separate documents for indexing.

Child Document Fields

Each chunk creates a child document with:

id - Format: parent_id-chunk_number
parent_id - ID of parent document
offset - Character offset from start of source text
length - Number of characters in chunk
chunk_number - Chunk sequence number (1-indexed)
total_chunks - Total number of chunks created
{dest} - The chunk content (default field name: text)

Example: Sentence-Based Chunking for RAG

stages:
  - class: com.kmwllc.lucille.stage.ChunkText
    name: chunk_for_rag
    source: content
    dest: text
    chunkingMethod: sentence
    chunksToMerge: 5        # Combine 5 sentences per chunk
    chunksToOverlap: 1      # Overlap by 1 sentence
    cleanChunks: true
    characterLimit: 2000    # Hard limit for LLM context
  
  - class: com.kmwllc.lucille.stage.EmitNestedChildren
    drop_parent: true       # Only index the chunks

Example: Paragraph Chunking

stages:
  - class: com.kmwllc.lucille.stage.ChunkText
    name: chunk_paragraphs
    source: article_text
    chunkingMethod: paragraph
    preMergeMinChunkLen: 100  # Merge small paragraphs
    cleanChunks: true

Example: Fixed-Size Chunks

stages:
  - class: com.kmwllc.lucille.stage.ChunkText
    name: fixed_chunks
    source: log_data
    chunkingMethod: fixed
    lengthToSplit: 500
    overlapPercentage: 10   # 10% overlap between chunks

Example: Custom Regex Chunking

stages:
  - class: com.kmwllc.lucille.stage.ChunkText
    name: chunk_by_section
    source: document_text
    chunkingMethod: custom
    regex: "(?m)^#{1,3}\\s+"  # Split on markdown headers
    cleanChunks: true

PromptOllama

Sends documents to local LLMs via Ollama for enrichment, extraction, or generation tasks.

hostURL

String

required

URL to your Ollama server (e.g., http://localhost:11434).

modelName

String

required

Name of the Ollama model to use (e.g., llama2, mistral, phi3).See Ollama library for available models.

systemPrompt

String

System prompt instructing the LLM what to do with the document. If not specified, uses the model’s default system prompt (if any).

fields

List<String>

Specific fields to send to the LLM. If empty or not specified, sends the entire document.

requireJSON

Boolean

default:false

Whether to require JSON-formatted responses:

true - Stage throws exception on non-JSON responses, sets format: "json" in request
false - Non-JSON responses are placed in ollamaResponse field

timeout

Long

Request timeout in seconds. Defaults to Ollama’s default (10 seconds).Increase for complex prompts or when running multiple workers.

updateMode

String

default:"overwrite"

How to handle fields extracted from JSON responses: overwrite, append, or skip.

It’s highly recommended to instruct the LLM to output JSON for better reliability and automatic field extraction.

Example: Extract Entities

stages:
  - class: com.kmwllc.lucille.stage.PromptOllama
    name: extract_entities
    hostURL: http://localhost:11434
    modelName: llama2
    systemPrompt: |
      Extract all person names, organizations, and locations from the text.
      Return your response as JSON with keys: "people", "organizations", "locations".
      Each key should have an array of strings.
    fields: ["content"]
    requireJSON: true

Example: Generate Summaries

stages:
  - class: com.kmwllc.lucille.stage.PromptOllama
    name: generate_summary
    hostURL: http://localhost:11434
    modelName: mistral
    systemPrompt: |
      Provide a concise 2-3 sentence summary of the document.
      Return JSON with key "summary".
    requireJSON: true
    timeout: 30

Example: Content Classification

stages:
  - class: com.kmwllc.lucille.stage.PromptOllama
    name: classify_content
    hostURL: http://localhost:11434
    modelName: phi3
    systemPrompt: |
      Classify this document into one of these categories:
      "technical", "business", "legal", "marketing", "other".
      Return JSON with keys "category" and "confidence" (0-1).
    fields: ["title", "description"]
    requireJSON: true

Example: Sentiment Analysis

stages:
  - class: com.kmwllc.lucille.stage.PromptOllama
    name: analyze_sentiment
    hostURL: http://localhost:11434
    modelName: llama2
    systemPrompt: |
      Analyze the sentiment of this text.
      Return JSON with "sentiment" (positive/negative/neutral) and "score" (-1 to 1).
    fields: ["review_text"]
    requireJSON: true
    updateMode: skip  # Don't overwrite existing sentiment

RandomVector

Generates random vectors for testing embedding pipelines and vector search.

dest

String

required

Destination field for the random vector.

dimensions

Integer

required

Number of dimensions in the vector.

Example: Generate Test Vectors

stages:
  - class: com.kmwllc.lucille.stage.RandomVector
    name: create_test_vectors
    dest: test_embedding
    dimensions: 768

EmbeddedPython

Executes Python code within the JVM using Jython for custom transformations.

script

String

required

Python code to execute. Has access to doc object.

scriptPath

String

Path to external Python script file. Alternative to inline script.

Uses Jython which supports Python 2.7 syntax. For Python 3, use ExternalPython instead.

Example: Custom Field Logic

stages:
  - class: com.kmwllc.lucille.stage.EmbeddedPython
    name: custom_scoring
    script: |
      if doc.has("views") and doc.has("likes"):
          views = doc.getInt("views")
          likes = doc.getInt("likes")
          score = (likes / views) * 100 if views > 0 else 0
          doc.setField("engagement_score", score)

ExternalPython

Executes external Python 3 scripts for advanced processing.

scriptPath

String

required

Path to Python script to execute.

pythonExecutable

String

Path to Python executable. Defaults to system python3.

Example: ML Model Inference

stages:
  - class: com.kmwllc.lucille.stage.ExternalPython
    name: run_ml_model
    scriptPath: /scripts/classify_document.py
    pythonExecutable: /usr/bin/python3

ApplyJavascript

Executes JavaScript code for document transformation using GraalVM.

script

String

Inline JavaScript code to execute.

scriptPath

String

Path to external JavaScript file.

returnField

String

Field to store script return value.

Example: Custom Transformation

stages:
  - class: com.kmwllc.lucille.stage.ApplyJavascript
    name: calculate_score
    script: |
      var views = doc.getInt('views');
      var clicks = doc.getInt('clicks');
      var ctr = views > 0 ? (clicks / views) * 100 : 0;
      doc.setField('click_through_rate', ctr);

ApplyJSONata

Applies JSONata transformations to JSON data.

source

String

required

Source field containing JSON data.

dest

String

required

Destination field for transformation result.

expression

String

required

JSONata expression to apply.

Example: Transform JSON

stages:
  - class: com.kmwllc.lucille.stage.ApplyJSONata
    name: transform_metadata
    source: api_response
    dest: processed_data
    expression: |
      {
        "user": user.name,
        "email": user.contact.email,
        "tags": items.tag
      }

Best Practices for AI/ML Stages

Rate Limiting

For API-based stages (OpenAI), implement rate limiting or use multiple API keys to avoid throttling.

Chunk Sizing

For RAG applications, chunks of 500-2000 characters with 10-20% overlap typically work well.

Error Handling

AI/ML stages can fail due to API limits, timeouts, or model errors. Configure appropriate error handling and retries.

Cost Management

Monitor API usage for OpenAI stages. Use cheaper models (text-embedding-3-small) for development and testing.

Common Patterns

RAG (Retrieval-Augmented Generation) Pipeline

stages:
  # 1. Chunk the document
  - class: com.kmwllc.lucille.stage.ChunkText
    source: content
    chunkingMethod: sentence
    chunksToMerge: 5
    chunksToOverlap: 1
    characterLimit: 1500
  
  # 2. Generate embeddings for chunks
  - class: com.kmwllc.lucille.stage.OpenAIEmbed
    source: text
    apiKey: ${OPENAI_API_KEY}
    embedDocument: false
    embedChildren: true
    modelName: text-embedding-3-small
  
  # 3. Emit children as separate documents
  - class: com.kmwllc.lucille.stage.EmitNestedChildren
    drop_parent: true

Document Enrichment with LLM

stages:
  # 1. Extract key information
  - class: com.kmwllc.lucille.stage.PromptOllama
    hostURL: http://localhost:11434
    modelName: llama2
    systemPrompt: "Extract key topics, entities, and summary as JSON"
    requireJSON: true
  
  # 2. Generate embeddings
  - class: com.kmwllc.lucille.stage.OpenAIEmbed
    source: summary
    dest: summary_vector
    apiKey: ${OPENAI_API_KEY}
    embedDocument: true
    embedChildren: false

Connectors

Stages

Indexers

Plugins

AI & Machine Learning Stages

Overview

OpenAIEmbed

Example: Basic Document Embedding

Example: Embed Document and Children

Example: Semantic Search Pipeline

ChunkText

Child Document Fields

Example: Sentence-Based Chunking for RAG

Example: Paragraph Chunking

Example: Fixed-Size Chunks

Example: Custom Regex Chunking

PromptOllama

Example: Extract Entities

Example: Generate Summaries

Example: Content Classification

Example: Sentiment Analysis

RandomVector

Example: Generate Test Vectors

EmbeddedPython

Example: Custom Field Logic

ExternalPython

Example: ML Model Inference

ApplyJavascript

Example: Custom Transformation

ApplyJSONata

Example: Transform JSON

Best Practices for AI/ML Stages

Rate Limiting

Chunk Sizing

Error Handling

Cost Management

Common Patterns

RAG (Retrieval-Augmented Generation) Pipeline

Document Enrichment with LLM

Connectors

Stages

Indexers

Plugins

​Overview

​OpenAIEmbed

​Example: Basic Document Embedding

​Example: Embed Document and Children

​Example: Semantic Search Pipeline

​ChunkText

​Child Document Fields

​Example: Sentence-Based Chunking for RAG

​Example: Paragraph Chunking

​Example: Fixed-Size Chunks

​Example: Custom Regex Chunking

​PromptOllama

​Example: Extract Entities

​Example: Generate Summaries

​Example: Content Classification

​Example: Sentiment Analysis

​RandomVector

​Example: Generate Test Vectors

​EmbeddedPython

​Example: Custom Field Logic

​ExternalPython

​Example: ML Model Inference

​ApplyJavascript

​Example: Custom Transformation

​ApplyJSONata

​Example: Transform JSON

​Best Practices for AI/ML Stages

Rate Limiting

Chunk Sizing

Error Handling

Cost Management

​Common Patterns

​RAG (Retrieval-Augmented Generation) Pipeline

​Document Enrichment with LLM

Overview

OpenAIEmbed

Example: Basic Document Embedding

Example: Embed Document and Children

Example: Semantic Search Pipeline

ChunkText

Child Document Fields

Example: Sentence-Based Chunking for RAG

Example: Paragraph Chunking

Example: Fixed-Size Chunks

Example: Custom Regex Chunking

PromptOllama

Example: Extract Entities

Example: Generate Summaries

Example: Content Classification

Example: Sentiment Analysis

RandomVector

Example: Generate Test Vectors

EmbeddedPython

Example: Custom Field Logic

ExternalPython

Example: ML Model Inference

ApplyJavascript

Example: Custom Transformation

ApplyJSONata

Example: Transform JSON

Best Practices for AI/ML Stages

Common Patterns

RAG (Retrieval-Augmented Generation) Pipeline

Document Enrichment with LLM