Overview
ParquetConnector reads Parquet columnar data files from local filesystems, S3, or other Hadoop-compatible storage systems. It’s optimized for processing large datasets efficiently, making it ideal for data lakes, analytics pipelines, and vector search use cases.
Location: com.kmwllc.lucille.parquet.connector.ParquetConnector
ParquetConnector is a plugin connector. Add the
lucille-parquet dependency to your project to use it.Maven Dependency
Configuration Parameters
The path to traverse for
.parquet files. Can be local filesystem path or S3 URI (e.g., s3://bucket/path/).Name of a field in the Parquet schema to use as the Document ID. Must exist in all files processed, or an exception will be thrown.
URI for the filesystem to use. Examples:
- Local:
file:/// - S3:
s3a://
AWS S3 access key ID. Required only when accessing S3.
AWS S3 secret access key. Required only when accessing S3.
Maximum number of Documents to publish. Defaults to no limit (process all rows).
Number of rows to skip from the beginning of each Parquet file.
The name of the connector instance.
Must be
com.kmwllc.lucille.parquet.connector.ParquetConnector.The name of the pipeline to send documents to.
Examples
Local Parquet Files
Read Parquet files from a local directory:S3 Parquet Files
Read Parquet files from S3:With Pagination
Process a subset of rows using start and limit:Vector Search Ingestion
From the Vector Search Guide:Behavior
File Discovery
ParquetConnector recursively traverses thepathToStorage directory and processes all .parquet files found.
Schema Mapping
Each column in the Parquet file becomes a field in the Lucille Document:- Column names map directly to field names
- Parquet data types are converted to appropriate Java types
- Nested structures are preserved
ID Field Requirement
TheidField parameter must reference a column that exists in all Parquet files being processed. If a file lacks this column, the connector will throw an exception.
Performance
ParquetConnector leverages Parquet’s columnar format for efficient data access:- Columnar storage: Only read columns needed by your pipeline
- Compression: Parquet’s built-in compression reduces I/O
- Predicate pushdown: Filter rows efficiently at the storage layer
S3 Configuration
When accessing S3, the connector uses Hadoop’s S3A filesystem:- Path-style access
- INT96 timestamp handling for Avro compatibility
Common Patterns
Processing Large Datasets
For datasets too large to process in one run, use multiple connectors with pagination:Data Lake Ingestion
Process all Parquet files in a data lake partition:Troubleshooting
Schema Evolution
If Parquet files have different schemas, ensure theidField exists in all files. Consider using schema evolution best practices:
- Keep ID field consistent across all files
- Add new columns with default values
- Avoid removing or renaming the ID column
Memory Usage
Large Parquet files can consume significant memory. Monitor heap usage and adjust:- JVM heap size (
-Xmx) - Use
limitto process in batches - Enable Parquet page-level filtering if supported
Next Steps
Vector Search Guide
Complete tutorial using ParquetConnector for vector embeddings
FileConnector
Process other file formats with FileConnector