Documents - Lucille

Overview

A Document is the fundamental unit of data in Lucille. Documents flow through Connectors, Pipelines, and Indexers, accumulating fields and transformations along the way before being indexed into search engines.

public interface Document {
    String ID_FIELD = "id";
    String RUNID_FIELD = "run_id";
    String CHILDREN_FIELD = ".children";
    String DROP_FIELD = ".dropped";
}

Documents are represented internally as JSON-like structures with named fields that can contain single or multi-valued data.

Document Structure

Required Fields

Every document has a unique identifier:

Document doc = Document.create("doc-12345");
String id = doc.getId();  // "doc-12345"

Reserved Fields

Certain field names are reserved for internal use:

id

Unique document identifier. Cannot be modified after creation.

run_id

Identifier for the batch ingest that produced this document.

.children

Internal storage for child documents. Not directly accessible.

.dropped

Flag indicating document should not be indexed.

Attempting to set or modify reserved fields (except via their dedicated methods) throws IllegalArgumentException.

Creating Documents

Multiple ways to create documents:

From ID

Document doc = Document.create("article-123");

From ID and Run ID

Document doc = Document.create("article-123", "run-abc-def");

From JSON String

String json = "{\"id\":\"article-123\",\"title\":\"Hello World\"}";
Document doc = Document.createFromJson(json);

From JSON with ID Transformation

Document doc = Document.createFromJson(json, id -> "prefix-" + id);

From JSON ObjectNode

ObjectNode node = mapper.createObjectNode();
node.put("id", "article-123");
node.put("title", "Hello World");
Document doc = Document.create(node);

Supported Field Types

Documents support a rich set of data types:

String

doc.setField("title", "Introduction to Lucille");
String title = doc.getString("title");

Boolean

doc.setField("published", true);
Boolean published = doc.getBoolean("published");

Integer

doc.setField("views", 1000);
Integer views = doc.getInt("views");

Long

doc.setField("user_id", 12345678901L);
Long userId = doc.getLong("user_id");

Float

doc.setField("score", 95.5f);
Float score = doc.getFloat("score");

Double

doc.setField("price", 99.99);
Double price = doc.getDouble("price");

Instant

doc.setField("created_at", Instant.now());
Instant created = doc.getInstant("created_at");
// Stored as ISO-8601 string internally

Date

doc.setField("publish_date", new Date());
Date publishDate = doc.getDate("publish_date");

Timestamp

doc.setField("modified", new Timestamp(System.currentTimeMillis()));
Timestamp modified = doc.getTimestamp("modified");

byte[]

doc.setField("thumbnail", imageBytes);
byte[] thumbnail = doc.getBytes("thumbnail");

JsonNode

ObjectNode metadata = mapper.createObjectNode();
metadata.put("author", "Jane Doe");
metadata.put("version", 2);
doc.setField("metadata", metadata);
JsonNode meta = doc.getJson("metadata");

Field Operations

Setting Fields

Single value (overwrites existing):

doc.setField("title", "New Title");

Object parameter (auto-detects type):

Object value = "some value";  // Could be any supported type
doc.setField("field_name", value);

Adding to Fields

Add to existing field (converts to multi-valued):

doc.setField("tags", "java");
doc.addToField("tags", "lucille");
doc.addToField("tags", "search");
// Result: tags = ["java", "lucille", "search"]

Set or Add

Create or append depending on field existence:

doc.setOrAdd("category", "technology");
// First call: creates single-valued field

doc.setOrAdd("category", "programming");
// Second call: converts to multi-valued ["technology", "programming"]

Update with Mode

Flexible update based on mode:

// Overwrite existing value
doc.update("title", UpdateMode.OVERWRITE, "Title 1", "Title 2");
// Result: title = ["Title 1", "Title 2"]

// Skip if field exists
doc.update("author", UpdateMode.SKIP, "Default Author");
// Only sets if author field doesn't exist

// Append values
doc.update("keywords", UpdateMode.APPEND, "keyword1", "keyword2");
// Adds to existing values

Update modes:

OVERWRITE: Replace existing values
SKIP: Only set if field doesn’t exist
APPEND: Add to existing values (same as setOrAdd)

Multi-Valued Fields

Creating Multi-Valued Fields

// Multiple addToField calls
doc.addToField("authors", "Alice");
doc.addToField("authors", "Bob");
doc.addToField("authors", "Charlie");

// Or multiple setOrAdd calls
doc.setOrAdd("tags", "java");
doc.setOrAdd("tags", "etl");
doc.setOrAdd("tags", "search");

Accessing Multi-Valued Fields

Get first value:

String firstAuthor = doc.getString("authors");  // "Alice"

Get all values:

List<String> allAuthors = doc.getStringList("authors");
// ["Alice", "Bob", "Charlie"]

Check if multi-valued:

boolean isMulti = doc.isMultiValued("authors");  // true
int count = doc.length("authors");  // 3

Working with Lists

All supported types have list variants:

List<String> strings = doc.getStringList("tags");
List<Integer> numbers = doc.getIntList("scores");
List<Boolean> flags = doc.getBooleanList("features");
List<Instant> timestamps = doc.getInstantList("events");
List<JsonNode> objects = doc.getJsonList("items");

Field Utilities

Checking Field Existence

// Check if field exists (even if null)
boolean exists = doc.has("title");

// Check if field exists AND is not null
boolean hasValue = doc.hasNonNull("title");

Removing Fields

// Remove entire field
doc.removeField("temp_data");

// Remove specific array index
doc.removeFromArray("tags", 2);  // Remove third tag

Renaming Fields

doc.renameField("old_name", "new_name", UpdateMode.OVERWRITE);

Removing Duplicates

doc.addToField("tags", "java");
doc.addToField("tags", "lucille");
doc.addToField("tags", "java");  // Duplicate

doc.removeDuplicateValues("tags", "tags_unique");
// tags_unique = ["java", "lucille"]

// Or remove in-place:
doc.removeDuplicateValues("tags", null);
// tags = ["java", "lucille"]

Field Length

int length = doc.length("tags");
// Returns number of values (1 for single-valued, N for multi-valued)

Getting All Field Names

Set<String> fieldNames = doc.getFieldNames();
// Preserves insertion order

Nested JSON Fields

Documents support nested JSON path access:

Getting Nested Values

doc.setField("user", objectMapper.createObjectNode()
    .put("name", "John")
    .put("age", 30));

// Access nested field
JsonNode name = doc.getNestedJson("user.name");

// Access array element
ArrayNode tags = (ArrayNode) doc.getNestedJson("metadata.tags");
JsonNode firstTag = doc.getNestedJson("metadata.tags[0]");

Setting Nested Values

// Creates nested structure if doesn't exist
doc.setNestedJson("user.address.city", TextNode.valueOf("Boston"));

// Set array element
doc.setNestedJson("metadata.tags[0]", TextNode.valueOf("important"));

Removing Nested Values

doc.removeNestedJson("user.address.zipcode");

Path Syntax

// Object paths use dots
"user.profile.name"

// Array indices use brackets
"items[0].title"

// Combined
"metadata.authors[2].contact.email"

Document Metadata

Run ID

// Set during creation
Document doc = Document.create("id", "run-123");
String runId = doc.getRunId();  // "run-123"

// Or initialize later
doc.initializeRunId("run-123");

// Clear run ID
doc.clearRunId();

Dropped Status

// Mark document as dropped (won't be indexed)
doc.setDropped(true);

boolean isDropped = doc.isDropped();

// Check in stages
if (doc.isDropped()) {
    // Skip processing for dropped documents
    return null;
}

Child Documents

Documents can have nested child documents:

Adding Children

Document parent = Document.create("parent-1");

Document child1 = Document.create("child-1");
child1.setField("type", "chapter");
parent.addChild(child1);

Document child2 = Document.create("child-2");
child2.setField("type", "chapter");
parent.addChild(child2);

Accessing Children

boolean hasKids = parent.hasChildren();
List<Document> children = parent.getChildren();

for (Document child : children) {
    System.out.println(child.getId());
}

Removing Children

parent.removeChildren();

Child Document Flow

When a Stage generates children:

@Override
public Iterator<Document> processDocument(Document doc) {
    Document child1 = Document.create("child-1", doc.getRunId());
    Document child2 = Document.create("child-2", doc.getRunId());
    
    return Arrays.asList(child1, child2).iterator();
}

Children automatically:

Inherit the parent’s run ID (if not already set)
Flow through downstream stages
Get indexed separately
Generate their own events

Child documents generated during pipeline processing are not stored in the parent’s .children field - they flow independently through the system.

Document Copying

Deep Copy

Document original = Document.create("doc-1");
original.setField("title", "Original");

Document copy = original.deepCopy();
copy.setField("title", "Modified");

// original remains unchanged
System.out.println(original.getString("title"));  // "Original"
System.out.println(copy.getString("title"));       // "Modified"

Merging Documents

Add all fields from another document:

Document target = Document.create("target");
target.setField("title", "Target Title");

Document source = Document.create("source");
source.setField("author", "Jane Doe");
source.setField("tags", "java");

target.setOrAddAll(source);
// target now has: title, author, tags

Add specific field:

target.setOrAdd("field_name", source);
// Copies the value of "field_name" from source to target

JSON Transformation

Apply JSONata expressions to transform documents:

import com.dashjoin.jsonata.Jsonata;

Document doc = Document.create("doc-1");
doc.setField("firstName", "John");
doc.setField("lastName", "Doe");

Jsonata expr = Jsonata.jsonata(
    "{ 'fullName': firstName & ' ' & lastName }"
);

doc.transform(expr);
// Document now has: fullName = "John Doe"
// firstName and lastName are removed

Transformation expressions cannot modify reserved fields (id, run_id, etc.) or return non-object results.

Converting to Map

Get document as a Map for external APIs:

Map<String, Object> docMap = doc.asMap();

// Use with Jackson
ObjectMapper mapper = new ObjectMapper();
String json = mapper.writeValueAsString(docMap);

// Or pass to external libraries
thirdPartyApi.processDocument(docMap);

Common Patterns

Building Documents from Database Results

ResultSet rs = statement.executeQuery("SELECT * FROM products");
while (rs.next()) {
    Document doc = Document.create(rs.getString("id"), runId);
    doc.setField("name", rs.getString("name"));
    doc.setField("price", rs.getDouble("price"));
    doc.setField("in_stock", rs.getBoolean("in_stock"));
    doc.setField("created_at", rs.getTimestamp("created_at"));
    publisher.publish(doc);
}

Conditional Field Setting

String description = getDescription();
if (description != null && !description.isEmpty()) {
    doc.setField("description", description);
}

// Or use setOrAdd to avoid null checks
doc.setOrAdd("description", description);

Building Hierarchical Data

ObjectNode address = mapper.createObjectNode();
address.put("street", "123 Main St");
address.put("city", "Boston");
address.put("state", "MA");
doc.setField("address", address);

// Or use nested paths
doc.setNestedJson("address.street", TextNode.valueOf("123 Main St"));
doc.setNestedJson("address.city", TextNode.valueOf("Boston"));
doc.setNestedJson("address.state", TextNode.valueOf("MA"));

Accumulating Values

for (String tag : tags) {
    doc.addToField("tags", tag);
}

for (String author : authors) {
    doc.setOrAdd("authors", author);
}

Best Practices

Unique IDs: Ensure document IDs are stable and unique
Set Run ID: Always include run ID for tracking
Null Checks: Check for null before accessing field values
Type Safety: Use typed getters (getString, getInt, etc.)
Field Naming: Use consistent, descriptive field names
Reserved Fields: Never try to directly set reserved fields
Multi-Valued Fields: Use getStringList() not getString() for multi-valued
Child Documents: Set unique IDs and inherit run ID
Deep Copy: Use when you need independent document copies
Validate Early: Check required fields in early pipeline stages

Next Steps

Stages

Learn how stages transform documents

Pipelines

Understand document flow through pipelines

Connectors

See how documents are created

Indexers

Learn how documents are indexed

Get Started

Core Concepts

Configuration

Deployment

Guides

​Overview

​Document Structure

​Required Fields

​Reserved Fields

id

run_id

.children

.dropped

​Creating Documents

​From ID

​From ID and Run ID

​From JSON String

​From JSON with ID Transformation

​From JSON ObjectNode

​Supported Field Types

​Field Operations

​Setting Fields

​Adding to Fields

​Set or Add

​Update with Mode

​Multi-Valued Fields

​Creating Multi-Valued Fields

​Accessing Multi-Valued Fields

​Working with Lists

​Field Utilities

​Checking Field Existence

​Removing Fields

​Renaming Fields

​Removing Duplicates

​Field Length

​Getting All Field Names

​Nested JSON Fields

​Getting Nested Values

​Setting Nested Values

​Removing Nested Values

​Path Syntax

​Document Metadata

​Run ID

​Dropped Status

​Child Documents

​Adding Children

​Accessing Children

​Removing Children

​Child Document Flow

​Document Copying

​Deep Copy

​Merging Documents

​JSON Transformation

​Converting to Map

​Common Patterns

​Building Documents from Database Results

​Conditional Field Setting

​Building Hierarchical Data

​Accumulating Values

​Best Practices

​Next Steps

Stages

Pipelines

Connectors

Indexers

Overview

Document Structure

Required Fields

Reserved Fields

Creating Documents

From ID

From ID and Run ID

From JSON String

From JSON with ID Transformation

From JSON ObjectNode

Supported Field Types

Field Operations

Setting Fields

Adding to Fields

Set or Add

Update with Mode

Multi-Valued Fields

Creating Multi-Valued Fields

Accessing Multi-Valued Fields

Working with Lists

Field Utilities

Checking Field Existence

Removing Fields

Renaming Fields

Removing Duplicates

Field Length

Getting All Field Names

Nested JSON Fields

Getting Nested Values

Setting Nested Values

Removing Nested Values

Path Syntax

Document Metadata

Run ID

Dropped Status

Child Documents

Adding Children

Accessing Children

Removing Children

Child Document Flow

Document Copying

Deep Copy

Merging Documents

JSON Transformation

Converting to Map

Common Patterns

Building Documents from Database Results

Conditional Field Setting

Building Hierarchical Data

Accumulating Values

Best Practices

Next Steps