Overview
AI and Machine Learning stages integrate powerful models for embeddings, text generation, chunking, and intelligent document processing. These stages enable advanced use cases like semantic search, RAG (Retrieval-Augmented Generation), and LLM-powered enrichment.OpenAIEmbed
Generates vector embeddings using OpenAI’s embedding models. Supports both document-level and child document embeddings.Field containing the text to embed.
Field to store the embedding vectors.
OpenAI API key. Should be stored securely (e.g., environment variable).
Whether to embed the document’s source field.
Whether to embed the document’s children source fields.
OpenAI embedding model to use:
text-embedding-3-small- Fast, cost-effective (default)text-embedding-3-large- Higher quality, larger dimensionstext-embedding-ada-002- Legacy model
Number of dimensions for the embedding vector. Only supported in
text-embedding-3-* models. If not specified, uses the model’s default dimensions.The stage automatically truncates text to OpenAI’s token limit (8191 tokens) to prevent API errors.
Example: Basic Document Embedding
Example: Embed Document and Children
Example: Semantic Search Pipeline
ChunkText
Splits text into smaller chunks for processing by LLMs or embedding models. Produces child documents containing the chunks.Field containing the text to chunk.
Field name for chunk content in child documents.
How to split the text:
sentence- Use OpenNLP sentence detection (default)paragraph- Split on consecutive line breaksfixed- Split by character countcustom- Use custom regex pattern
Custom regex pattern for splitting. Required when
chunkingMethod=custom.Character length for each chunk. Required when
chunkingMethod=fixed.Whether to remove newlines and trim whitespace from chunks.
Minimum chunk length in characters. Smaller chunks are merged with neighboring chunks.
Maximum chunk length before merging. Chunks are truncated to this length.
Number of initial chunks to merge into final chunks.Example:
chunksToMerge=2 merges [chunk1, chunk2], [chunk3, chunk4], etc.Number of chunks to overlap when merging. Creates sliding window.Example:
chunksToMerge=3, chunksToOverlap=1 creates [c1,c2,c3], [c3,c4,c5], [c5,c6,c7]Cannot be used with overlapPercentage.Percentage of neighboring chunks to include as overlap.Cannot be used with
chunksToOverlap.Hard limit on final chunk size. Applied after all other processing.
This stage creates attached child documents. Use
EmitNestedChildren stage afterward to emit children as separate documents for indexing.Child Document Fields
Each chunk creates a child document with:id- Format:parent_id-chunk_numberparent_id- ID of parent documentoffset- Character offset from start of source textlength- Number of characters in chunkchunk_number- Chunk sequence number (1-indexed)total_chunks- Total number of chunks created{dest}- The chunk content (default field name:text)
Example: Sentence-Based Chunking for RAG
Example: Paragraph Chunking
Example: Fixed-Size Chunks
Example: Custom Regex Chunking
PromptOllama
Sends documents to local LLMs via Ollama for enrichment, extraction, or generation tasks.URL to your Ollama server (e.g.,
http://localhost:11434).Name of the Ollama model to use (e.g.,
llama2, mistral, phi3).See Ollama library for available models.System prompt instructing the LLM what to do with the document. If not specified, uses the model’s default system prompt (if any).
Specific fields to send to the LLM. If empty or not specified, sends the entire document.
Whether to require JSON-formatted responses:
true- Stage throws exception on non-JSON responses, setsformat: "json"in requestfalse- Non-JSON responses are placed inollamaResponsefield
Request timeout in seconds. Defaults to Ollama’s default (10 seconds).Increase for complex prompts or when running multiple workers.
How to handle fields extracted from JSON responses:
overwrite, append, or skip.It’s highly recommended to instruct the LLM to output JSON for better reliability and automatic field extraction.
Example: Extract Entities
Example: Generate Summaries
Example: Content Classification
Example: Sentiment Analysis
RandomVector
Generates random vectors for testing embedding pipelines and vector search.Destination field for the random vector.
Number of dimensions in the vector.
Example: Generate Test Vectors
EmbeddedPython
Executes Python code within the JVM using Jython for custom transformations.Python code to execute. Has access to
doc object.Path to external Python script file. Alternative to inline
script.Example: Custom Field Logic
ExternalPython
Executes external Python 3 scripts for advanced processing.Path to Python script to execute.
Path to Python executable. Defaults to system
python3.Example: ML Model Inference
ApplyJavascript
Executes JavaScript code for document transformation using GraalVM.Inline JavaScript code to execute.
Path to external JavaScript file.
Field to store script return value.
Example: Custom Transformation
ApplyJSONata
Applies JSONata transformations to JSON data.Source field containing JSON data.
Destination field for transformation result.
JSONata expression to apply.
Example: Transform JSON
Best Practices for AI/ML Stages
Rate Limiting
For API-based stages (OpenAI), implement rate limiting or use multiple API keys to avoid throttling.
Chunk Sizing
For RAG applications, chunks of 500-2000 characters with 10-20% overlap typically work well.
Error Handling
AI/ML stages can fail due to API limits, timeouts, or model errors. Configure appropriate error handling and retries.
Cost Management
Monitor API usage for OpenAI stages. Use cheaper models (text-embedding-3-small) for development and testing.