Connector Configuration

Connectors read data from various sources and emit Documents to be processed by pipelines. Each connector is executed in sequence as part of a Lucille run.

Connector Structure

Connectors are defined as a list in the configuration file:

connectors: [
  {
    name: "connector1"
    class: "com.kmwllc.lucille.connector.FileConnector"
    pipeline: "pipeline1"
    # connector-specific settings...
  },
  {
    name: "connector2"
    class: "com.kmwllc.lucille.connector.CSVConnector"
    pipeline: "pipeline2"
    # connector-specific settings...
  }
]

Common Parameters

All connectors support these base parameters:

name

string

required

Name to assign to this connector for logging and console output

class

string

required

Fully qualified class name of the connector implementation

pipeline

string

required

Name of the pipeline that will process this connector’s output

FileConnector

Traverses local and cloud storage (S3, GCP, Azure) and publishes a Document for each file encountered.

Basic Configuration

connectors: [
  {
    name: "file-ingest"
    class: "com.kmwllc.lucille.connector.FileConnector"
    pipeline: "file-processing"
    paths: ["/path/to/files", "s3://my-bucket/prefix"]
  }
]

Parameters

paths

list<string>

required

Paths or URIs to traverse. Supports local paths and cloud storage URIs.Note: S3 URIs must be percent-encoded. Use s3://test/folder%20with%20spaces for paths with special characters.

Control which files are processed:

Regex patterns to include files

filterOptions: {
  includes: [".*\\.pdf$", ".*\\.docx$"]
}

Regex patterns to exclude files

filterOptions: {
  excludes: [".*\\.DS_Store$", ".*\\.tmp$"]
}

Include only files modified within this period

filterOptions: {
  lastModifiedCutoff: "3d"  # Last 3 days
}

Include only files not published within this period. Requires state configuration.

filterOptions: {
  lastPublishedCutoff: "6h"  # Not published in last 6 hours
}

File Options

fileOptions.getFileContent

boolean

default:"true"

Fetch file content during traversal

fileOptions.handleArchivedFiles

boolean

default:"false"

Process files inside archive containers (zip, tar, etc.)

fileOptions.handleCompressedFiles

boolean

default:"false"

Process compressed files (gzip, etc.)

fileOptions.moveToAfterProcessing

string

URI to move files after successful processing. Only works with single input path.

fileOptions: {
  moveToAfterProcessing: "/path/to/processed/"
}

fileOptions.moveToErrorFolder

string

URI to move files if processing fails. Only works with single input path.

State Management

Track file publish times to avoid reprocessing:

Embedded Database
Derby Database
External Database

state {
  # Uses H2 embedded database by default
  driver: "org.h2.Driver"
  connectionString: "jdbc:h2:./state/my-connector"
  jdbcUser: ""
  jdbcPassword: ""
  tableName: "file_state"
  performDeletions: true
  pathLength: 200
}

state {
  driver: "org.apache.derby.iapi.jdbc.AutoloadedDriver"
  connectionString: "jdbc:derby:lucille_state;"
  jdbcUser: ""
  jdbcPassword: ""
  
  # Disable automatic deletion tracking
  performDeletions: false
}

state {
  driver: "org.postgresql.Driver"
  connectionString: ${DATABASE_URL}
  jdbcUser: ${DB_USER}
  jdbcPassword: ${DB_PASSWORD}
  tableName: "lucille_file_state"
}

state.driver

string

default:"org.h2.Driver"

JDBC driver class

state.connectionString

string

JDBC connection string. Defaults to jdbc:h2:./state/{CONNECTOR_NAME} if omitted.

state.jdbcUser

string

default:""

Database username

state.jdbcPassword

string

default:""

Database password

state.tableName

string

Table name for state tracking. Defaults to connector name.

state.performDeletions

boolean

default:"true"

Delete rows for files removed from storage

state.pathLength

number

default:"200"

Maximum length for stored file paths when Lucille creates the table

Cloud Storage Configuration

AWS S3
Azure Blob
Google Cloud

s3 {
  # Use default credentials if not specified
  accessKeyId: ${?AWS_ACCESS_KEY_ID}
  secretAccessKey: ${?AWS_SECRET_ACCESS_KEY}
  region: ${?AWS_DEFAULT_REGION}
  
  # Max file references in memory
  maxNumOfPages: 100
}

s3.accessKeyId

string

AWS access key ID. Omit to use default credentials.

s3.secretAccessKey

string

AWS secret access key. Omit to use default credentials.

s3.region

string

AWS region for S3

s3.maxNumOfPages

number

default:"100"

Maximum number of file references to hold in memory

# Option 1: Connection String
azure {
  connectionString: ${?AZURE_CONNECTION_STRING}
  maxNumOfPages: 100
}

# Option 2: Account Name and Key
azure {
  accountName: ${?AZURE_ACCOUNT_NAME}
  accountKey: ${?AZURE_ACCOUNT_KEY}
  maxNumOfPages: 100
}

azure.connectionString

string

Azure connection string (alternative to accountName/accountKey)

azure.accountName

string

Azure account name

azure.accountKey

string

Azure account key

azure.maxNumOfPages

number

default:"100"

Maximum number of file references to hold in memory

gcp {
  pathToServiceKey: ${?GCP_SERVICE_KEY_PATH}
  maxNumOfPages: 100
}

gcp.pathToServiceKey

string

required

Path to the Google Cloud service key JSON file

gcp.maxNumOfPages

number

default:"100"

Maximum number of file references to hold in memory

File Handlers

Configure custom handling for specific file types:

fileOptions: {
  csv {
    docIdPrefix: "csvHandled-"
    filenameField: "file_name"
  }
}

fileHandlers

object

Map of file type to handler configuration. Supports csv, json, xml.Supply a class property to override the default handler.

Complete Example

connectors: [
  {
    name: "local-files"
    class: "com.kmwllc.lucille.connector.FileConnector"
    pipeline: "file-pipeline"
    paths: ["/data/documents"]
    
    filterOptions: {
      excludes: [".*\\.DS_Store$"]
      lastModifiedCutoff: "7d"
    }
  }
]

CSVConnector (Deprecated)

CSVConnector is deprecated. Use FileConnector with CSVFileHandler instead.

Produces documents from a CSV file.

connectors: [
  {
    name: "csv-legacy"
    class: "com.kmwllc.lucille.connector.CSVConnector"
    pipeline: "csv-pipeline"
    path: "/path/to/file.csv"
    
    # Optional parameters
    lineNumberField: "row_number"
    filenameField: "filename"
    filePathField: "filepath"
    idField: "doc_id"
    separatorChar: ","
    useTabs: false
    interpretQuotes: true
    lowercaseFields: true
    
    # Move after processing
    moveToAfterProcessing: "/processed/"
    moveToErrorFolder: "/errors/"
  }
]

SequenceConnector

Generates a sequence of numbered documents for testing:

connectors: [
  {
    name: "test-sequence"
    class: "com.kmwllc.lucille.connector.SequenceConnector"
    pipeline: "test-pipeline"
    numDocs: 1000
  }
]

JSONConnector

Reads documents from JSON files:

connectors: [
  {
    name: "json-ingest"
    class: "com.kmwllc.lucille.connector.JSONConnector"
    pipeline: "json-pipeline"
    path: "/path/to/data.json"
  }
]

SolrConnector

Queries Solr and creates documents from results:

connectors: [
  {
    name: "solr-source"
    class: "com.kmwllc.lucille.connector.SolrConnector"
    pipeline: "migration-pipeline"
    solrUrl: "http://localhost:8983/solr/source-collection"
    query: "*:*"
    rows: 100
  }
]

KafkaConnector

Consumes documents from a Kafka topic:

connectors: [
  {
    name: "kafka-source"
    class: "com.kmwllc.lucille.connector.KafkaConnector"
    pipeline: "streaming-pipeline"
    topic: "input-topic"
  }
]

KafkaConnector requires Kafka configuration in the kafka section. See Kafka Configuration.

Get Started

Core Concepts

Configuration

Deployment

Guides

Connector Configuration

Connector Structure

Common Parameters

FileConnector

Basic Configuration

Parameters

File Options

State Management

Cloud Storage Configuration

File Handlers

Complete Example

CSVConnector (Deprecated)

SequenceConnector

JSONConnector

SolrConnector

KafkaConnector

Next Steps

Build Pipelines

File Handlers

Get Started

Core Concepts

Configuration

Deployment

Guides

​Connector Structure

​Common Parameters

​FileConnector

​Basic Configuration

​Parameters

​Filter Options

​File Options

​State Management

​Cloud Storage Configuration

​File Handlers

​Complete Example

​CSVConnector (Deprecated)

​SequenceConnector

​JSONConnector

​SolrConnector

​KafkaConnector

​Next Steps

Build Pipelines

File Handlers

Connector Structure

Common Parameters

FileConnector

Basic Configuration

Parameters

Filter Options

File Options

State Management

Cloud Storage Configuration

File Handlers

Complete Example

CSVConnector (Deprecated)

SequenceConnector

JSONConnector

SolrConnector

KafkaConnector

Next Steps