Skip to main content
Lucille uses Typesafe Config to interpret configuration files written in HOCON (Human-Optimized Config Object Notation), a superset of JSON that provides a more readable and flexible syntax.

HOCON Format

HOCON provides several advantages over plain JSON:
  • Comments: Use # or // for comments
  • No quotes required: Keys and most string values don’t need quotes
  • Trailing commas: Allowed in lists and objects
  • Environment variable substitution: Reference environment variables with ${?ENV_VAR}
  • Value concatenation: Extend or merge configuration blocks
  • Include directives: Reference other configuration files

Basic Syntax Example

# This is a comment
connectors: [
  {
    name: "connector1"
    class: "com.kmwllc.lucille.connector.FileConnector"
    pipeline: "pipeline1"
    paths: ["/path/to/files"]
  }
]

Configuration Structure

A Lucille configuration file consists of four main sections:

Connectors

Define data sources and how documents are ingested

Pipelines

Configure processing stages for document transformation

Indexers

Set up search engine destinations and indexing behavior

Kafka

Configure distributed mode with Kafka messaging

Environment Variable Substitution

Use ${?ENV_VAR} to reference environment variables with optional fallback:
s3 {
  accessKeyId: ${?AWS_ACCESS_KEY_ID}
  secretAccessKey: ${?AWS_SECRET_ACCESS_KEY}
  defaultRegion: ${?AWS_DEFAULT_REGION}
}

opensearch {
  url: "https://localhost:9200"
  url: ${?OPENSEARCH_URL}  # Override with env var if present
  index: "my-index"
  index: ${?OPENSEARCH_INDEX}
}
The ? makes the substitution optional - if the environment variable isn’t set, the previous value is used.

Configuration Merging

HOCON allows you to declare configuration blocks and extend them later:
# Declare initial pipelines
pipelines: [
  {name: "pipeline1", stages: [{class: "com.kmwllc.lucille.stage.CopyFields"}]}
]

# Add more pipelines by merging
pipelines: ${pipelines} [
  {name: "pipeline2", stages: [{class: "com.kmwllc.lucille.stage.CreateChildrenStage"}]}
]

Including External Files

You can reference other configuration files to organize complex configurations:
include "base-config.conf"
include "connectors/file-connector.conf"
include "pipelines/main-pipeline.conf"
See file-to-file-example.conf in the source for more examples.

Duration Formats

Many configuration options accept duration strings in HOCON format:
  • 3s - 3 seconds
  • 5m - 5 minutes
  • 1h - 1 hour
  • 2d - 2 days
  • 6h - 6 hours
filterOptions: {
  # Only files modified in last 3 days
  lastModifiedCutoff: "3d"
  
  # Files not published in last 6 hours
  lastPublishedCutoff: "6h"
}

Configuration Validation

Lucille validates all configuration using the Spec system. Each component (connector, stage, indexer) declares a public static Spec SPEC that defines:
  • Required and optional parameters
  • Parameter types (string, boolean, number, list, nested config)
  • Validation rules
Validation errors reference the component’s name property for easy debugging.

Validation Example

The validation-example.conf file shows intentional configuration errors:
pipelines: [
  {
    name: "pipeline1",
    stages: [
      {
        # Invalid class - will fail validation
        class: "com.kmwllc.lucille.stage.InvalidStage"
      },
      {
        class: "com.kmwllc.lucille.stage.Length",
        fieldMapping {
          a: "a_length"
        }
        # Invalid property not in the Spec
        invalidProperty: true
      }
    ]
  }
]

ConfigUtils Utility

Lucille provides ConfigUtils for working with configurations:
getOrDefault
method
Get a configuration value with a fallback default:
String value = ConfigUtils.getOrDefault(config, "setting", "defaultValue");
createHeaderArray
method
Create HTTP headers from a configuration block:
Header[] headers = ConfigUtils.createHeaderArray(config, "headers");

Best Practices

Always provide descriptive name properties for connectors, pipelines, and stages. These names appear in logs and error messages.
connectors: [
  {
    name: "sharepoint-docs-connector"  # Good
    # name: "connector1"  # Avoid generic names
  }
]
Never hardcode credentials. Use environment variables:
# Bad - hardcoded credentials
opensearch {
  url: "https://admin:password@localhost:9200"
}

# Good - environment variables
opensearch {
  url: ${OPENSEARCH_URL}
}
Split large configurations into logical files:
include "connectors.conf"
include "pipelines.conf"
include "indexers.conf"
Document why specific settings are used:
worker {
  # Increased timeout for large document processing
  maxProcessingSecs: 1200
  
  # Limit retries to prevent infinite loops on malformed data
  maxRetries: 2
}
When building complex pipelines, test with minimal configurations first, then add complexity:
  1. Start with one connector and empty pipeline
  2. Add stages one at a time
  3. Enable indexing last

IDE Support

For better HOCON editing in IntelliJ IDEA, install the HOCON plugin for syntax highlighting and formatting.

Next Steps

Configure Connectors

Set up data sources to ingest documents

Build Pipelines

Create processing stages for document transformation

Set Up Indexers

Configure search engine destinations

Enable Kafka

Run Lucille in distributed mode