SolrConnector - Lucille

Overview

The SolrConnector issues requests to Apache Solr and optionally publishes query results as Documents. It supports pre/post actions for index management, query execution with cursor-based pagination, and flexible request formatting (JSON or XML). Class: com.kmwllc.lucille.connector.SolrConnector
Extends: AbstractConnector

Key Features

Execute Solr queries and publish results as Documents
Pre and post actions for index management (commits, deletes, etc.)
Cursor-based pagination for efficient large result set handling
JSON or XML request formatting
RunId wildcard substitution in actions
Configurable query parameters
Custom field mapping and ID handling

Class Signature

package com.kmwllc.lucille.connector;

public class SolrConnector extends AbstractConnector {
  public SolrConnector(Config config);
  public SolrConnector(Config config, SolrClient client);
  
  @Override
  public void preExecute(String runId) throws ConnectorException;
  
  @Override
  public void execute(Publisher publisher) throws ConnectorException;
  
  @Override
  public void postExecute(String runId) throws ConnectorException;
  
  @Override
  public void close() throws ConnectorException;
  
  public List<String> getLastExecutedPreActions();
  public List<String> getLastExecutedPostActions();
}

Configuration Parameters

Required Parameters

solr

Map

required

Solr connection parameters. See SolrUtils.SOLR_PARENT_SPEC for details.Common parameters:

url: Solr base URL (e.g., "http://localhost:8983/solr/collection1")
type: Client type ("http" or "cloud")
zkHost: Zookeeper host (for cloud mode)

Example:

solr: {
  url: "http://localhost:8983/solr/documents"
  type: "http"
}

Optional Parameters

preActions

List<String>

List of Solr requests to issue before execution. Executed in preExecute().Supports the {runId} wildcard, which is replaced with the current run ID.Use cases: Deletes, commits, index preparationExample:

preActions: [
  "{\"delete\":{\"query\":\"run_id:{runId}\"}}",
  "{\"commit\":{}}"
]

postActions

List<String>

List of Solr requests to issue after execution. Executed in postExecute().Supports the {runId} wildcard.Use cases: Commits, optimize, cleanupExample:

postActions: [
  "{\"commit\":{}}",
  "{\"optimize\":{}}"
]

solrParams

Map<String, Object>

Query parameters to use when a pipeline is configured. Only used if pipeline is set.Supports Solr query parameters like q, fq, fl, rows, etc.Example:

solrParams: {
  q: "*:*"
  fq: ["status:published", "date:[NOW-1DAY TO NOW]"]
  fl: "id,title,content,author"
  rows: 1000
}

The sort and cursorMark parameters are automatically set for cursor-based pagination.

useXml

Boolean

default:"false"

Send XML requests instead of JSON for actions.Recommendation: Use JSON requests. Solr performs more validation on JSON commands than XML.

idField

String

default:"id"

Solr field to use as the Document ID when publishing. Only used if pipeline is set.Example: "documentId", "uuid"

Query Mode vs. Action-Only Mode

The connector operates in two modes:

Action-Only Mode (No Pipeline)

When pipeline is not configured:

Only preActions and postActions are executed
No query is issued
No documents are published
Use for index management tasks

Example: Commit and optimize an index

connector: {
  name: "solr-commit"
  class: "com.kmwllc.lucille.connector.SolrConnector"
  # No pipeline
  
  solr: {
    url: "http://localhost:8983/solr/myindex"
  }
  
  postActions: [
    "{\"commit\":{}}",
    "{\"optimize\":{\"maxSegments\":1}}"
  ]
}

Query Mode (With Pipeline)

When pipeline is configured:

Execute preActions (if configured)
Query Solr using solrParams
Publish each result as a Document
Execute postActions (if configured)

Example: Query and reindex documents

connector: {
  name: "solr-reindex"
  class: "com.kmwllc.lucille.connector.SolrConnector"
  pipeline: "reindex-pipeline"
  
  solr: {
    url: "http://localhost:8983/solr/source"
  }
  
  solrParams: {
    q: "status:active"
    fq: "created:[NOW-30DAYS TO NOW]"
    fl: "*"
    rows: 5000
  }
  
  idField: "doc_id"
}

Configuration Examples

Basic Query and Publish

connector: {
  name: "solr-query"
  class: "com.kmwllc.lucille.connector.SolrConnector"
  pipeline: "process-pipeline"
  
  solr: {
    url: "http://localhost:8983/solr/documents"
    type: "http"
  }
  
  solrParams: {
    q: "*:*"
    fq: "published:true"
    fl: "id,title,body,author,date"
    rows: 1000
  }
  
  idField: "id"
}

Delete by Query with RunId

connector: {
  name: "solr-delete-old"
  class: "com.kmwllc.lucille.connector.SolrConnector"
  # No pipeline - action only
  
  solr: {
    url: "http://localhost:8983/solr/documents"
  }
  
  preActions: [
    "{\"delete\":{\"query\":\"run_id:{runId}\"}}"
  ]
  
  postActions: [
    "{\"commit\":{}}"
  ]
}

SolrCloud with Complex Query

connector: {
  name: "solrcloud-query"
  class: "com.kmwllc.lucille.connector.SolrConnector"
  pipeline: "enrichment-pipeline"
  
  solr: {
    type: "cloud"
    zkHost: "zk1:2181,zk2:2181,zk3:2181"
    collection: "documents"
  }
  
  solrParams: {
    q: "category:technology"
    fq: [
      "lang:en",
      "quality:[5 TO *]",
      "date:[NOW-7DAYS TO NOW]"
    ]
    fl: "id,title,content,category,quality,date,tags"
    rows: 10000
  }
  
  idField: "id"
  docIdPrefix: "solr-"
}

Reindex with Pre/Post Actions

connector: {
  name: "solr-reindex"
  class: "com.kmwllc.lucille.connector.SolrConnector"
  pipeline: "transform-pipeline"
  
  solr: {
    url: "http://solr.example.com:8983/solr/source_index"
  }
  
  preActions: [
    # Mark documents for this run
    "{\"add\":{\"doc\":{\"id\":\"run_marker_{runId}\",\"type\":\"run_marker\",\"run_id\":\"{runId}\"}}}",
    "{\"commit\":{}}"
  ]
  
  solrParams: {
    q: "*:*"
    fq: "-type:run_marker"
    fl: "*"
    rows: 5000
  }
  
  postActions: [
    # Delete the run marker
    "{\"delete\":{\"query\":\"id:run_marker_{runId}\"}}",
    "{\"commit\":{}}"
  ]
}

XML Actions

connector: {
  name: "solr-xml-actions"
  class: "com.kmwllc.lucille.connector.SolrConnector"
  
  solr: {
    url: "http://localhost:8983/solr/documents"
  }
  
  useXml: true
  
  preActions: [
    "<delete><query>run_id:{runId}</query></delete>"
  ]
  
  postActions: [
    "<commit/>",
    "<optimize maxSegments='1'/>"
  ]
}

Cursor-Based Pagination

When querying Solr (with a pipeline), the connector uses cursor-based pagination for efficient traversal of large result sets:

Add sort: idField asc to the query
Set initial cursorMark to *
Execute query
Process results (publish documents)
If cursorMark changed, update and repeat from step 3
If cursorMark unchanged, all results have been processed

Advantages

Efficient for large result sets
No deep pagination performance penalty
Consistent results even with index updates

Requirements

The idField must be unique and indexed
Results are sorted by idField in ascending order

You cannot override the sort or cursorMark parameters in solrParams. They are automatically managed for cursor pagination.

Document Field Mapping

For each Solr document in the query results:

Extract the ID from the idField (apply docIdPrefix)
Create a Lucille Document with that ID
For each Solr field (except the ID field):
- Convert field name to lowercase
- Skip if it matches the ID field name or “id”
- Add field values to the Document
Publish the Document

Field name handling:

All field names are lowercased
The ID field is not duplicated in the Document
Multi-valued Solr fields become multi-valued Document fields

Action Format

Actions are Solr update requests formatted as JSON or XML strings.

JSON Actions

preActions: [
  # Delete by query
  "{\"delete\":{\"query\":\"status:archived\"}}",
  
  # Add a document
  "{\"add\":{\"doc\":{\"id\":\"doc1\",\"title\":\"Test\"}}}",
  
  # Commit
  "{\"commit\":{}}",
  
  # Optimize
  "{\"optimize\":{\"maxSegments\":5}}"
]

XML Actions

useXml: true
preActions: [
  # Delete by query
  "<delete><query>status:archived</query></delete>",
  
  # Add a document
  "<add><doc><field name='id'>doc1</field><field name='title'>Test</field></doc></add>",
  
  # Commit
  "<commit/>",
  
  # Optimize
  "<optimize maxSegments='5'/>"
]

RunId Wildcard

The {runId} wildcard is replaced with the current run ID in preExecute() and postExecute():

preActions: [
  # Before: "{\"delete\":{\"query\":\"run_id:{runId}\"}}"
  # After:  "{\"delete\":{\"query\":\"run_id:run-2024-01-15-123456\"}}"
]

Use this to:

Delete documents from previous runs
Tag documents with the current run ID
Create run-specific markers

Lifecycle Methods

preExecute(String runId)

Replace {runId} wildcard in preActions
Execute each action sequentially
Log responses

execute(Publisher publisher)

If pipeline is configured:

Build Solr query from solrParams
Add sort: idField asc and cursorMark: *
Execute query
For each result:
- Create Document from Solr document
- Publish Document
If cursor changed, update and repeat from step 3
If cursor unchanged, complete

If pipeline is not configured, this method does nothing.

postExecute(String runId)

Replace {runId} wildcard in postActions
Execute each action sequentially
Log responses

close()

Closes the Solr client and releases resources.

Error Handling

Action Failures

If an action fails, a ConnectorException is thrown with details:

Failed to perform action: {"delete":{"query":"invalid query syntax"}}

Query Failures

If the query fails, a ConnectorException is thrown:

Unable to query Solr: [error details]

Publishing Failures

If publishing a document fails, a ConnectorException is thrown:

Unable to publish document: [error details]

Performance Considerations

Query Performance

Set rows to a reasonable batch size (default Solr limit is often 10)
Use filter queries (fq) for better caching
Limit fields with fl to reduce data transfer
Add appropriate indexes for query and filter fields

Cursor Pagination

Cursor-based pagination is efficient for large result sets, but:

Requires sorting by a unique field
Cannot jump to arbitrary pages
May not reflect real-time index changes

Action Timing

Pre-actions execute before any documents are published
Post-actions execute after all documents are published
Use commits strategically to balance consistency and performance

Limitations

Cannot override sort or cursorMark in solrParams
Only supports single-collection queries (no join queries)
No built-in support for child documents from Solr
Actions are executed sequentially (no parallelization)
Field names are always lowercased

Connectors

Stages

Indexers

Plugins

​Overview

​Key Features

​Class Signature

​Configuration Parameters

​Required Parameters

​Optional Parameters

​Query Mode vs. Action-Only Mode

​Action-Only Mode (No Pipeline)

​Query Mode (With Pipeline)

​Configuration Examples

​Basic Query and Publish

​Delete by Query with RunId

​SolrCloud with Complex Query

​Reindex with Pre/Post Actions

​XML Actions

​Cursor-Based Pagination

​Advantages

​Requirements

​Document Field Mapping

​Action Format

​JSON Actions

​XML Actions

​RunId Wildcard

​Lifecycle Methods

​preExecute(String runId)

​execute(Publisher publisher)

​postExecute(String runId)

​close()

​Error Handling

​Action Failures

​Query Failures

​Publishing Failures

​Performance Considerations

​Query Performance

​Cursor Pagination

​Action Timing

​Limitations

​Next Steps

RSSConnector

KafkaConnector

Overview

Key Features

Class Signature

Configuration Parameters

Required Parameters

Optional Parameters

Query Mode vs. Action-Only Mode

Action-Only Mode (No Pipeline)

Query Mode (With Pipeline)

Configuration Examples

Basic Query and Publish

Delete by Query with RunId

SolrCloud with Complex Query

Reindex with Pre/Post Actions

XML Actions

Cursor-Based Pagination

Advantages

Requirements

Document Field Mapping

Action Format

JSON Actions

XML Actions

RunId Wildcard

Lifecycle Methods

preExecute(String runId)

execute(Publisher publisher)

postExecute(String runId)

close()

Error Handling

Action Failures

Query Failures

Publishing Failures

Performance Considerations

Query Performance

Cursor Pagination

Action Timing

Limitations

Next Steps