Overview
TheFileConnector traverses local and cloud storage from one or more roots and publishes a Document for each file encountered. It supports include/exclude regex filters, recency cutoffs, optional content fetching, archive/compressed file handling, file moves after processing, and optional JDBC-backed state to avoid republishing recently handled files.
Class: com.kmwllc.lucille.connector.FileConnectorExtends:
AbstractConnector
Key Features
- Traverse local file systems, AWS S3, Google Cloud Storage, and Azure Blob Storage
- Regex-based file filtering (include/exclude patterns)
- Recency filters based on modification time and last published time
- Archive file handling (ZIP, TAR, etc.)
- Compressed file handling (GZIP, BZIP2, etc.)
- Move files after processing or on error
- JDBC-backed state management to prevent duplicate processing
- Per-file-type handlers (CSV, JSON, XML)
Class Signature
Configuration Parameters
Required Parameters
Paths or URIs to traverse. Supports local paths and cloud storage URIs.Cloud URI Formats:
- S3:
s3://bucket-name/path/to/folder/(must be percent-encoded for spaces/special characters) - GCP:
gs://bucket-name/path/to/folder/ - Azure:
https://account.blob.core.windows.net/container/path/ - Local:
/path/to/local/folderorfile:///path/to/local/folder
Filter Options
List of regex patterns to include files. Only files matching at least one pattern are processed.Example:
[".*\\.pdf$", ".*\\.docx$"] - Only process PDF and DOCX filesList of regex patterns to exclude files. Files matching any pattern are skipped.Example:
[".*\\.tmp$", ".*/temp/.*"] - Skip temporary files and files in temp foldersDuration string to include only files modified within this period. Uses HOCON duration format.Examples:
"1h", "2d", "30m", "7d"Only files modified within the specified duration before the current time are processed.Duration string to include only files not published within this period. Requires state configuration.Examples:
"1h", "2d", "30m"This setting has no effect unless state is configured. It prevents republishing files that were recently processed.
File Handling Options
Whether to fetch file content during traversal. Set to false for metadata-only indexing.
Whether to process archive files (ZIP, TAR, JAR, etc.) and extract their contents.When enabled, modification/publish cutoffs apply to both the container and its entries.
Whether to process compressed files (GZIP, BZIP2, XZ, etc.) and decompress their contents.
URI to move files after successful processing. Only works with a single input path.Example:
"s3://processed-bucket/completed/"URI to move files if processing fails. Only works with a single input path.Example:
"/path/to/error/folder/"State Management
State tracking allows the connector to remember which files have been published and when, preventing duplicate processing across runs.JDBC driver class for state storage.Common values:
org.h2.Driver, com.mysql.cj.jdbc.Driver, org.postgresql.DriverJDBC connection string. If omitted, an embedded H2 database is created at
./state/{CONNECTOR_NAME}.Examples:- H2:
jdbc:h2:./state/my-connector - MySQL:
jdbc:mysql://localhost:3306/lucille - PostgreSQL:
jdbc:postgresql://localhost:5432/lucille
Database username for state storage.
Database password for state storage.
Table name for storing state. Defaults to the connector name.
Whether to delete rows for files that have been removed from storage.
Maximum length for stored file paths when Lucille creates the state table.
Cloud Storage Options
- AWS S3
- Google Cloud
- Azure
AWS access key ID. Omit to use default credential provider chain.
AWS secret access key. Omit to use default credential provider chain.
Both
accessKeyId and secretAccessKey must be specified together or omitted together.AWS region for S3.Example:
"us-east-1"Maximum number of file references to hold in memory at once.
File Handlers
Per-type FileHandler configuration for CSV, JSON, and XML files.Supply a
class to override the default handler. Otherwise, built-in handlers are used.Example:Document Fields
Each published Document includes these fields:Full path to the file (URI format).
Last modification timestamp of the file.
Creation timestamp of the file (may be null for some storage types).
Size of the file in bytes.
Raw file content (only if
fileOptions.getFileContent is true).Configuration Examples
Basic Local File Traversal
S3 with State Management
Multiple Storage Sources
Archive Processing with File Moves
Storage Client Architecture
The FileConnector uses pluggableStorageClient implementations:
StorageClient Interface
Built-in Storage Clients
LocalStorageClient
LocalStorageClient
Traverses local file systems using
Files.walkFileTree().URI Scheme: file (or no scheme)No configuration required.S3StorageClient
S3StorageClient
Accesses AWS S3 using the AWS SDK v2.URI Scheme:
s3Configuration: See S3 options aboveGoogleStorageClient
GoogleStorageClient
Accesses Google Cloud Storage using the GCP client library.URI Scheme:
gsConfiguration: See GCP options aboveAzureStorageClient
AzureStorageClient
Accesses Azure Blob Storage using the Azure SDK.URI Scheme:
https (with blob.core.windows.net authority)Configuration: See Azure options aboveArchive File Handling
WhenhandleArchivedFiles is enabled, the connector:
- Detects archive files (ZIP, TAR, JAR, etc.)
- Extracts entries from the archive
- Publishes a separate Document for each entry
- Uses the separator
!in file paths:archive.zip!internal/file.txt
Modification and publish cutoffs apply to both the archive container and individual entries.
State Management Details
When state is configured:- The connector creates a table (default name: connector name) with columns for file path and last published timestamp
- Before publishing, it checks if the file was recently published (based on
lastPublishedCutoff) - After publishing, it updates the timestamp for that file path
- Files that are moved or renamed are always republished regardless of state
- If
performDeletionsis true, rows for deleted files are removed from the state table
Performance Considerations
- Use
maxNumOfPagesto control memory usage when traversing large cloud storage buckets - Set
getFileContent: falseif you only need file metadata - Use
filterOptions.includesandfilterOptions.excludesto reduce the number of files processed - Consider using
lastModifiedCutofffor incremental ingestion - State management adds overhead but prevents duplicate processing
Next Steps
DatabaseConnector
Query relational databases via JDBC
Connectors Overview
Learn about the Connector lifecycle and interface