Overview
TheSolrConnector issues requests to Apache Solr and optionally publishes query results as Documents. It supports pre/post actions for index management, query execution with cursor-based pagination, and flexible request formatting (JSON or XML).
Class: com.kmwllc.lucille.connector.SolrConnectorExtends:
AbstractConnector
Key Features
- Execute Solr queries and publish results as Documents
- Pre and post actions for index management (commits, deletes, etc.)
- Cursor-based pagination for efficient large result set handling
- JSON or XML request formatting
- RunId wildcard substitution in actions
- Configurable query parameters
- Custom field mapping and ID handling
Class Signature
Configuration Parameters
Required Parameters
Solr connection parameters. See
SolrUtils.SOLR_PARENT_SPEC for details.Common parameters:url: Solr base URL (e.g.,"http://localhost:8983/solr/collection1")type: Client type ("http"or"cloud")zkHost: Zookeeper host (for cloud mode)
Optional Parameters
List of Solr requests to issue before execution. Executed in
preExecute().Supports the {runId} wildcard, which is replaced with the current run ID.Use cases: Deletes, commits, index preparationExample:List of Solr requests to issue after execution. Executed in
postExecute().Supports the {runId} wildcard.Use cases: Commits, optimize, cleanupExample:Query parameters to use when a pipeline is configured. Only used if
pipeline is set.Supports Solr query parameters like q, fq, fl, rows, etc.Example:The
sort and cursorMark parameters are automatically set for cursor-based pagination.Send XML requests instead of JSON for actions.Recommendation: Use JSON requests. Solr performs more validation on JSON commands than XML.
Solr field to use as the Document ID when publishing. Only used if
pipeline is set.Example: "documentId", "uuid"Query Mode vs. Action-Only Mode
The connector operates in two modes:Action-Only Mode (No Pipeline)
Whenpipeline is not configured:
- Only
preActionsandpostActionsare executed - No query is issued
- No documents are published
- Use for index management tasks
Query Mode (With Pipeline)
Whenpipeline is configured:
- Execute
preActions(if configured) - Query Solr using
solrParams - Publish each result as a Document
- Execute
postActions(if configured)
Configuration Examples
Basic Query and Publish
Delete by Query with RunId
SolrCloud with Complex Query
Reindex with Pre/Post Actions
XML Actions
Cursor-Based Pagination
When querying Solr (with a pipeline), the connector uses cursor-based pagination for efficient traversal of large result sets:- Add
sort: idField ascto the query - Set initial
cursorMarkto* - Execute query
- Process results (publish documents)
- If
cursorMarkchanged, update and repeat from step 3 - If
cursorMarkunchanged, all results have been processed
Advantages
- Efficient for large result sets
- No deep pagination performance penalty
- Consistent results even with index updates
Requirements
- The
idFieldmust be unique and indexed - Results are sorted by
idFieldin ascending order
You cannot override the
sort or cursorMark parameters in solrParams. They are automatically managed for cursor pagination.Document Field Mapping
For each Solr document in the query results:- Extract the ID from the
idField(applydocIdPrefix) - Create a Lucille Document with that ID
- For each Solr field (except the ID field):
- Convert field name to lowercase
- Skip if it matches the ID field name or “id”
- Add field values to the Document
- Publish the Document
- All field names are lowercased
- The ID field is not duplicated in the Document
- Multi-valued Solr fields become multi-valued Document fields
Action Format
Actions are Solr update requests formatted as JSON or XML strings.JSON Actions
XML Actions
RunId Wildcard
The{runId} wildcard is replaced with the current run ID in preExecute() and postExecute():
- Delete documents from previous runs
- Tag documents with the current run ID
- Create run-specific markers
Lifecycle Methods
preExecute(String runId)
- Replace
{runId}wildcard inpreActions - Execute each action sequentially
- Log responses
execute(Publisher publisher)
Ifpipeline is configured:
- Build Solr query from
solrParams - Add
sort: idField ascandcursorMark: * - Execute query
- For each result:
- Create Document from Solr document
- Publish Document
- If cursor changed, update and repeat from step 3
- If cursor unchanged, complete
pipeline is not configured, this method does nothing.
postExecute(String runId)
- Replace
{runId}wildcard inpostActions - Execute each action sequentially
- Log responses
close()
Closes the Solr client and releases resources.Error Handling
Action Failures
If an action fails, aConnectorException is thrown with details:
Query Failures
If the query fails, aConnectorException is thrown:
Publishing Failures
If publishing a document fails, aConnectorException is thrown:
Performance Considerations
Query Performance
- Set
rowsto a reasonable batch size (default Solr limit is often 10) - Use filter queries (
fq) for better caching - Limit fields with
flto reduce data transfer - Add appropriate indexes for query and filter fields
Cursor Pagination
Cursor-based pagination is efficient for large result sets, but:- Requires sorting by a unique field
- Cannot jump to arbitrary pages
- May not reflect real-time index changes
Action Timing
- Pre-actions execute before any documents are published
- Post-actions execute after all documents are published
- Use commits strategically to balance consistency and performance
Limitations
- Cannot override
sortorcursorMarkinsolrParams - Only supports single-collection queries (no join queries)
- No built-in support for child documents from Solr
- Actions are executed sequentially (no parallelization)
- Field names are always lowercased
Next Steps
RSSConnector
Read items from RSS/Atom feeds
KafkaConnector
Read messages from Kafka topics