ParquetConnector

Overview

ParquetConnector reads Parquet columnar data files from local filesystems, S3, or other Hadoop-compatible storage systems. It’s optimized for processing large datasets efficiently, making it ideal for data lakes, analytics pipelines, and vector search use cases. Location: com.kmwllc.lucille.parquet.connector.ParquetConnector

ParquetConnector is a plugin connector. Add the lucille-parquet dependency to your project to use it.

Maven Dependency

<dependency>
  <groupId>com.kmwllc</groupId>
  <artifactId>lucille-parquet</artifactId>
  <version>${lucille.version}</version>
</dependency>

Configuration Parameters

pathToStorage

String

required

The path to traverse for .parquet files. Can be local filesystem path or S3 URI (e.g., s3://bucket/path/).

idField

String

required

Name of a field in the Parquet schema to use as the Document ID. Must exist in all files processed, or an exception will be thrown.

fsUri

String

required

URI for the filesystem to use. Examples:

Local: file:///
S3: s3a://

s3Key

String

AWS S3 access key ID. Required only when accessing S3.

s3Secret

String

AWS S3 secret access key. Required only when accessing S3.

limit

Long

Maximum number of Documents to publish. Defaults to no limit (process all rows).

start

Long

default:0

Number of rows to skip from the beginning of each Parquet file.

name

String

required

The name of the connector instance.

class

String

required

Must be com.kmwllc.lucille.parquet.connector.ParquetConnector.

pipeline

String

The name of the pipeline to send documents to.

Examples

Local Parquet Files

Read Parquet files from a local directory:

connectors: [
  {
    name: "local-parquet"
    class: "com.kmwllc.lucille.parquet.connector.ParquetConnector"
    pipeline: "parquet-pipeline"
    
    pathToStorage: "/data/parquet/"
    idField: "id"
    fsUri: "file:///"
  }
]

S3 Parquet Files

Read Parquet files from S3:

connectors: [
  {
    name: "s3-parquet"
    class: "com.kmwllc.lucille.parquet.connector.ParquetConnector"
    pipeline: "parquet-pipeline"
    
    pathToStorage: "s3://my-bucket/data/parquet/"
    idField: "record_id"
    fsUri: "s3a://"
    
    s3Key: ${?AWS_ACCESS_KEY_ID}
    s3Secret: ${?AWS_SECRET_ACCESS_KEY}
  }
]

With Pagination

Process a subset of rows using start and limit:

connectors: [
  {
    name: "parquet-batch"
    class: "com.kmwllc.lucille.parquet.connector.ParquetConnector"
    pipeline: "parquet-pipeline"
    
    pathToStorage: "/data/large-dataset.parquet"
    idField: "id"
    fsUri: "file:///"
    
    start: 10000
    limit: 5000
  }
]

When using pagination (start / limit), it’s recommended to use separate connector instances for each Parquet file to ensure predictable results.

Vector Search Ingestion

From the Vector Search Guide:

connectors: [
  {
    name: "parquet-embeddings"
    class: "com.kmwllc.lucille.parquet.connector.ParquetConnector"
    pipeline: "vector-pipeline"
    
    pathToStorage: "s3://datasets/openai-10M-embeddings/"
    idField: "id"
    fsUri: "s3a://"
    
    s3Key: ${?AWS_ACCESS_KEY_ID}
    s3Secret: ${?AWS_SECRET_ACCESS_KEY}
    
    limit: 100000
  }
]

pipelines: [
  {
    name: "vector-pipeline"
    stages: [
      {
        class: "com.kmwllc.lucille.stage.RenameFields"
        fieldMapping: {
          embedding: "vector"
        }
      }
    ]
  }
]

pinecone {
  apiKey: ${?PINECONE_API_KEY}
  environment: "us-west1-gcp"
  indexName: "embeddings-index"
}

Behavior

File Discovery

ParquetConnector recursively traverses the pathToStorage directory and processes all .parquet files found.

Schema Mapping

Each column in the Parquet file becomes a field in the Lucille Document:

Column names map directly to field names
Parquet data types are converted to appropriate Java types
Nested structures are preserved

ID Field Requirement

The idField parameter must reference a column that exists in all Parquet files being processed. If a file lacks this column, the connector will throw an exception.

Performance

ParquetConnector leverages Parquet’s columnar format for efficient data access:

Columnar storage: Only read columns needed by your pipeline
Compression: Parquet’s built-in compression reduces I/O
Predicate pushdown: Filter rows efficiently at the storage layer

S3 Configuration

When accessing S3, the connector uses Hadoop’s S3A filesystem:

s3Key: "AKIAIOSFODNN7EXAMPLE"
s3Secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
fsUri: "s3a://"

The connector automatically configures:

Path-style access
INT96 timestamp handling for Avro compatibility

Common Patterns

Processing Large Datasets

For datasets too large to process in one run, use multiple connectors with pagination:

connectors: [
  {
    name: "batch-1"
    class: "com.kmwllc.lucille.parquet.connector.ParquetConnector"
    pathToStorage: "s3://data/large.parquet"
    idField: "id"
    fsUri: "s3a://"
    start: 0
    limit: 1000000
  },
  {
    name: "batch-2"
    class: "com.kmwllc.lucille.parquet.connector.ParquetConnector"
    pathToStorage: "s3://data/large.parquet"
    idField: "id"
    fsUri: "s3a://"
    start: 1000000
    limit: 1000000
  }
]

Data Lake Ingestion

Process all Parquet files in a data lake partition:

connectors: [
  {
    name: "datalake-ingest"
    class: "com.kmwllc.lucille.parquet.connector.ParquetConnector"
    pipeline: "analytics-pipeline"
    
    pathToStorage: "s3://datalake/events/year=2024/month=01/"
    idField: "event_id"
    fsUri: "s3a://"
  }
]

Troubleshooting

Schema Evolution

If Parquet files have different schemas, ensure the idField exists in all files. Consider using schema evolution best practices:

Keep ID field consistent across all files
Add new columns with default values
Avoid removing or renaming the ID column

Memory Usage

Large Parquet files can consume significant memory. Monitor heap usage and adjust:

JVM heap size (-Xmx)
Use limit to process in batches
Enable Parquet page-level filtering if supported

Next Steps

Vector Search Guide

Complete tutorial using ParquetConnector for vector embeddings

FileConnector

Process other file formats with FileConnector

Connectors

Stages

Indexers

Plugins

ParquetConnector

Overview

Maven Dependency

Configuration Parameters

Examples

Local Parquet Files

S3 Parquet Files

Vector Search Ingestion

Behavior

File Discovery

Schema Mapping

ID Field Requirement

Performance

S3 Configuration

Common Patterns

Processing Large Datasets

Data Lake Ingestion

Troubleshooting

Schema Evolution

Memory Usage

Next Steps

Vector Search Guide

FileConnector

Connectors

Stages

Indexers

Plugins

​Overview

​Maven Dependency

​Configuration Parameters

​Examples

​Local Parquet Files

​S3 Parquet Files

​With Pagination

​Vector Search Ingestion

​Behavior

​File Discovery

​Schema Mapping

​ID Field Requirement

​Performance

​S3 Configuration

​Common Patterns

​Processing Large Datasets

​Data Lake Ingestion

​Troubleshooting

​Schema Evolution

​Memory Usage

​Next Steps

Vector Search Guide

FileConnector

Overview

Maven Dependency

Configuration Parameters

Examples

Local Parquet Files

S3 Parquet Files

With Pagination

Vector Search Ingestion

Behavior

File Discovery

Schema Mapping

ID Field Requirement

Performance

S3 Configuration

Common Patterns

Processing Large Datasets

Data Lake Ingestion

Troubleshooting

Schema Evolution

Memory Usage

Next Steps