Pinecone Plugin

Overview

The Pinecone plugin provides the PineconeIndexer, which sends vector embeddings from documents to Pinecone’s vector database. It supports upsert and update operations, namespaces, and metadata storage. Maven Module: lucille-pinecone Java Class: com.kmwllc.lucille.pinecone.indexer.PineconeIndexer Source: PineconeIndexer.java

Installation

Add the plugin dependency to your pom.xml:

<dependency>
  <groupId>com.kmwllc</groupId>
  <artifactId>lucille-pinecone</artifactId>
  <version>${lucille.version}</version>
</dependency>

Configuration

Basic Configuration

indexer {
  type: "pinecone"
  batchSize: 1000  # Maximum for Pinecone
  
  pinecone {
    apiKey: "${PINECONE_API_KEY}"
    index: "my-index"
    defaultEmbeddingField: "embedding"
  }
}

With Namespaces

indexer {
  type: "pinecone"
  batchSize: 1000
  
  pinecone {
    apiKey: "${PINECONE_API_KEY}"
    index: "vectors"
    
    namespaces {
      "products": "product_embedding"
      "users": "user_embedding"
    }
    
    metadataFields: ["title", "category", "price"]
  }
}

Parameters

apiKey

string

required

Pinecone API key for authentication.Best practice: Use environment variable ${PINECONE_API_KEY}

index

string

required

Name of the Pinecone index to write vectors to.Example: "embeddings", "product-vectors"

defaultEmbeddingField

string

Document field containing the vector embeddings when not using namespaces.Required if namespaces is not set.Example: "embedding", "vector"

namespaces

Map<String, String>

Mapping of namespace names to embedding field names. Allows indexing different vector types to different namespaces.Example:

namespaces {
  "products": "product_embedding"
  "categories": "category_embedding"
}

Mutually exclusive with defaultEmbeddingField (one or the other must be set).

metadataFields

string[]

List of document fields to store as metadata with each vector. Metadata can be used for filtering during queries.Example: ["title", "category", "timestamp"]

mode

string

default:"upsert"

Operation mode:

upsert: Insert or replace vectors (recommended)
update: Only update existing vectors

Note: Update mode only modifies embeddings, not metadata.

Features

Upsert Operations

Insert new vectors or replace existing ones:

mode: "upsert"

If vector ID exists, it’s replaced
If vector ID is new, it’s inserted
Includes metadata fields

Update Operations

Modify only the embeddings:

mode: "update"

Updates only the vector values
Does not modify metadata
No error if vector doesn’t exist (Pinecone returns 200 OK)

Namespace Support

Organize vectors into namespaces:

namespaces {
  "products": "product_vector"
  "users": "user_vector"
}

Benefits:

Isolate different vector types
Query specific namespaces
Delete namespace contents independently
Same document can have vectors in multiple namespaces

Metadata Storage

Store filterable metadata with vectors:

metadataFields: ["category", "brand", "price", "in_stock"]

Metadata is converted to strings and stored with each vector. Use for:

Filtering query results
Post-retrieval processing
Displaying context

Deletion Support

Delete vectors using marker fields:

indexer {
  deletionMarkerField: "deleted"
  deletionMarkerFieldValue: "true"
}

Important: Pinecone serverless indexes do not support delete-by-metadata. Only delete-by-ID is supported. Deletion happens in all configured namespaces.

Batch Size Limits

Pinecone has a maximum batch size of 1000 vectors or 2MB, whichever is reached first.The indexer enforces a maximum batchSize of 1000. Larger dimensions will hit the 2MB limit with fewer vectors.

indexer {
  batchSize: 1000  # Maximum allowed
}

Connection Validation

The indexer validates connectivity during startup:

boolean isStable = PineconeUtils.isClientStable(client, indexName);

Checks:

API key is valid
Index exists
Connection is stable

If validation fails, the pipeline will not start.

Example Configurations

Simple vector indexing

indexer {
  type: "pinecone"
  batchSize: 500
  
  pinecone {
    apiKey: "${PINECONE_API_KEY}"
    index: "documents"
    defaultEmbeddingField: "embedding"
    metadataFields: ["title", "url", "timestamp"]
  }
}

Multi-namespace with different embeddings

indexer {
  type: "pinecone"
  batchSize: 1000
  
  pinecone {
    apiKey: "${PINECONE_API_KEY}"
    index: "hybrid-search"
    
    namespaces {
      "text": "text_embedding"
      "image": "image_embedding"
      "multimodal": "combined_embedding"
    }
    
    metadataFields: ["doc_type", "source", "created_at"]
  }
}

Update-only mode

indexer {
  type: "pinecone"
  batchSize: 1000
  
  pinecone {
    apiKey: "${PINECONE_API_KEY}"
    index: "vectors"
    defaultEmbeddingField: "updated_embedding"
    mode: "update"  # Only update existing vectors
  }
}

With deletion support

indexer {
  type: "pinecone"
  deletionMarkerField: "status"
  deletionMarkerFieldValue: "deleted"
  batchSize: 1000
  
  pinecone {
    apiKey: "${PINECONE_API_KEY}"
    index: "products"
    
    namespaces {
      "active": "embedding"
      "archived": "embedding"
    }
  }
}

Document Requirements

Documents must have embeddings in the specified field. Documents without embeddings will cause errors.Use a Drop stage with conditions to filter out documents without embeddings before indexing.

Example: Filter Documents Without Embeddings

pipeline {
  stages: [
    # Generate embeddings first
    {
      class: "com.kmwllc.lucille.stage.GenerateEmbeddings"
      textField: "content"
      embeddingField: "embedding"
    },
    
    # Drop documents without embeddings
    {
      class: "com.kmwllc.lucille.stage.Drop"
      conditions: [
        {
          operator: "is_null"
          field: "embedding"
        }
      ]
    }
  ]
}

indexer {
  type: "pinecone"
  pinecone {
    defaultEmbeddingField: "embedding"
  }
}

Metadata Format

Metadata fields are stored as strings:

Struct.newBuilder()
  .putAllFields(fields.stream()
    .filter(entry -> metadataFields.contains(entry.getKey()))
    .collect(Collectors.toUnmodifiableMap(
      entry -> entry.getKey(),
      entry -> Value.newBuilder()
        .setStringValue(entry.getValue().toString())
        .build()
    ))
  )
  .build()

All metadata values are converted to strings using toString().

Best Practices

Filter out documents without embeddings

Always use a Drop stage to exclude documents missing embeddings:

{
  class: "com.kmwllc.lucille.stage.Drop"
  conditions: [{operator: "is_null", field: "embedding"}]
}

Use namespaces for organization

Organize vectors by type, tenant, or use case:

Product vectors vs. user vectors
Different embedding models
Multi-tenant isolation
Test vs. production data

Limit metadata fields

Only include metadata that will be used for filtering:

Reduces storage costs
Faster queries
Smaller payloads

Use maximum batch size

Set batchSize: 1000 for best performance:

Fewer API requests
Better throughput
Lower latency

Adjust lower only for very high-dimensional vectors.

Handle deletion carefully

Only delete-by-ID is supported (no metadata queries)
Deletions happen across all namespaces
Consider soft deletes (metadata flag) as alternative

Troubleshooting

Embedding field is null error

Error while upserting vectors

Cause: Document is missing the embedding field.Solution:

Add Drop stage to filter out documents without embeddings
Verify embedding generation stage runs successfully
Check field name matches configuration

Batch size exceeds 1000

Maximum batch size for Pinecone is 1000

Solution: Reduce batchSize to 1000 or less:

indexer {
  batchSize: 1000
}

API key authentication failed

Solutions:

Verify API key is correct
Check environment variable is set
Ensure key has write permissions to the index
Confirm Pinecone account is active

Index not found

Solutions:

Create the index in Pinecone console first
Verify index name matches exactly (case-sensitive)
Check dimension matches embedding size
Ensure metric (cosine, euclidean, dotproduct) is appropriate

Upserted count mismatch

Number of upserted vectors does not match requested

Causes:

Network issues
Rate limiting
Pinecone service issues

Solutions:

Check Pinecone status page
Reduce batch size
Implement retry logic

Vector ID Strategy

Pinecone uses the document ID as the vector ID:

String vectorId = doc.getId();

Considerations:

IDs must be unique within each namespace
Same ID in different namespaces = different vectors
Use idOverrideField if you need different IDs

Performance Characteristics

Batch size: Up to 1000 vectors or 2MB per request
Throughput: Depends on Pinecone plan and index type
Latency: Typically 50-200ms per batch
Parallel workers: Safe to run multiple workers

Connectors

Stages

Indexers

Plugins

Pinecone Plugin

Overview

Installation

Configuration

Basic Configuration

With Namespaces

Parameters

Features

Upsert Operations

Update Operations

Namespace Support

Metadata Storage

Deletion Support

Batch Size Limits

Connection Validation

Example Configurations

Document Requirements

Example: Filter Documents Without Embeddings

Metadata Format

Best Practices

Troubleshooting

Vector ID Strategy

Performance Characteristics

See Also

Connectors

Stages

Indexers

Plugins

​Overview

​Installation

​Configuration

​Basic Configuration

​With Namespaces

​Parameters

​Features

​Upsert Operations

​Update Operations

​Namespace Support

​Metadata Storage

​Deletion Support

​Batch Size Limits

​Connection Validation

​Example Configurations

​Document Requirements

​Example: Filter Documents Without Embeddings

​Metadata Format

​Best Practices

​Troubleshooting

​Vector ID Strategy

​Performance Characteristics

​See Also

Overview

Installation

Configuration

Basic Configuration

With Namespaces

Parameters

Features

Upsert Operations

Update Operations

Namespace Support

Metadata Storage

Deletion Support

Batch Size Limits

Connection Validation

Example Configurations

Document Requirements

Example: Filter Documents Without Embeddings

Metadata Format

Best Practices

Troubleshooting

Vector ID Strategy

Performance Characteristics

See Also