Skip to main content

Overview

The SolrIndexer sends processed documents to Apache Solr using the SolrJ client library. It supports both standalone Solr instances and SolrCloud deployments with ZooKeeper coordination. Java Class: com.kmwllc.lucille.indexer.SolrIndexer Source: SolrIndexer.java

Configuration

Basic Configuration

indexer {
  type: "solr"
  
  solr {
    url: ["http://localhost:8983/solr"]
    defaultCollection: "my_collection"
  }
}

SolrCloud Configuration

indexer {
  type: "solr"
  
  solr {
    useCloudClient: true
    zkHosts: ["zk1:2181", "zk2:2181", "zk3:2181"]
    zkChroot: "/solr"
    defaultCollection: "my_collection"
  }
}

Parameters

url
string[]
One or more Solr base URLs (e.g., https://localhost:8983). Used for standalone Solr instances.Example: ["http://solr1:8983", "http://solr2:8983"]
useCloudClient
boolean
default:"false"
Whether to use the SolrCloud client. Set to true when connecting to a SolrCloud cluster with ZooKeeper.
zkHosts
string[]
ZooKeeper connection strings when using SolrCloud. Required when useCloudClient is true.Example: ["zk1:2181", "zk2:2181", "zk3:2181"]
zkChroot
string
ZooKeeper chroot path used with SolrCloud. Typically /solr or similar.
defaultCollection
string
Default Solr collection to index documents into when no indexOverrideField is present on the document.
userName
string
Username for HTTP basic authentication.
password
string
Password for HTTP basic authentication.
acceptInvalidCert
boolean
default:"false"
Allow invalid TLS certificates. Use with caution - only for development/testing.

Features

Multi-Collection Support

Route documents to different collections using the indexOverrideField:
indexer {
  indexOverrideField: "target_collection"
  
  solr {
    defaultCollection: "default_docs"
  }
}
Documents with a target_collection field will be sent to that collection instead of the default.

Child Documents

Solr’s nested document feature is fully supported. Documents can contain child documents in the children field:
{
  "id": "parent1",
  "title": "Parent Document",
  "children": [
    {
      "id": "child1",
      "content": "Child document content"
    }
  ]
}
Child documents are automatically converted to Solr’s _childDocuments_ format.

Deletion Support

indexer {
  deletionMarkerField: "delete"
  deletionMarkerFieldValue: "true"
}
Documents marked for deletion will be removed by ID using solrClient.deleteById().

SSL/TLS Configuration

Connect to Solr over HTTPS with custom SSL settings:
solr {
  url: ["https://solr.example.com:8983"]
  
  # SSL settings
  ssl.trustStorePath: "/path/to/truststore.jks"
  ssl.trustStorePassword: "password"
  ssl.keyStorePath: "/path/to/keystore.jks"
  ssl.keyStorePassword: "password"
}

Connection Validation

The indexer validates connectivity during startup:
  • Standalone Solr: Uses solrClient.ping() to verify the connection
  • SolrCloud: Checks cluster status via CollectionAdminRequest.ClusterStatus()
If validation fails, the pipeline will not start.

Batch Processing

Documents are sent to Solr in batches:
  1. Documents accumulate up to batchSize (default 100)
  2. Within each batch, documents are grouped by collection
  3. Add/update operations are sent first
  4. Delete operations are sent after adds/updates
  5. If an ID appears in both adds and deletes, operations are ordered correctly

Error Handling

Failed documents are tracked and reported:
// Documents that fail are returned with error details
Set<Pair<Document, String>> failedDocs = sendToIndex(documents);
Errors include:
  • Connection failures
  • Invalid field types
  • Collection not found
  • Authentication failures

Example Configurations

indexer {
  type: "solr"
  batchSize: 500
  
  solr {
    url: ["http://localhost:8983/solr"]
    defaultCollection: "documents"
  }
}
indexer {
  type: "solr"
  batchSize: 1000
  
  solr {
    useCloudClient: true
    zkHosts: ["zk1:2181", "zk2:2181", "zk3:2181"]
    zkChroot: "/solr"
    defaultCollection: "main_docs"
    userName: "solr_user"
    password: "secret_password"
  }
}
indexer {
  type: "solr"
  indexOverrideField: "collection_name"
  deletionMarkerField: "deleted"
  deletionMarkerFieldValue: "yes"
  
  solr {
    useCloudClient: true
    zkHosts: ["localhost:2181"]
    defaultCollection: "default"
  }
}

Best Practices

SolrCloud provides high availability, automatic failover, and distributed indexing. Use useCloudClient: true and connect via ZooKeeper.
  • Start with 500-1000 documents per batch
  • Reduce for documents with large text fields
  • Monitor Solr’s heap usage and adjust
  • Child documents cannot have their own children (nested limit)
  • Map types must be flattened to fields
  • Child IDs should be unique across the entire index
Field-based deletion (deleteByFieldField) executes a query before deleting. For high-volume deletions, consider:
  • Deleting by ID when possible
  • Batching deletions
  • Using Solr’s time-to-live (TTL) features

Troubleshooting

  • Verify Solr is running: curl http://localhost:8983/solr/admin/ping
  • Check firewall rules
  • Confirm correct port and protocol (HTTP vs HTTPS)
  • Create the collection in Solr first
  • Verify defaultCollection matches exactly (case-sensitive)
  • For SolrCloud, ensure collection exists in ZooKeeper
  • Verify credentials are correct
  • Check Solr’s security.json configuration
  • Ensure the user has write permissions to the collection
  • Ensure Solr schema matches document fields
  • Use ignoreFields to exclude problematic fields
  • Check for Map/Object fields (not supported directly)

See Also