Extraction → Graph Mutations

Overview

This guide shows how the Extraction bounded context produces mutation operations that the Graph bounded context consumes. You’ll learn how to generate IDs, construct operations, and produce JSONL files that drive graph updates.

Quick Example

Here’s what a complete extraction workflow looks like:

from shared_kernel.graph_primitives import EntityIdGenerator

# 1. Create a repository scoped to your data source
repo = GraphExtractionReadOnlyRepository(
    client=graph_client,
    data_source_id="github-repo-123"
)

# 2. Generate deterministic IDs
alice_id = EntityIdGenerator.generate("Person", "alice-smith")
# Returns: "person:1a2b3c4d5e6f7890"

bob_id = EntityIdGenerator.generate("Person", "bob-jones")
# Returns: "person:abcdef0123456789"

relationship_id = EntityIdGenerator.generate_edge_id("KNOWS", alice_id, bob_id)
# Returns: "knows:9f8e7d6c5b4a3210"

# 3. Check what already exists
existing_alice = repo.find_nodes_by_slug("alice-smith", "Person")

# 4. Produce mutation operations
mutations = [
    {
        "op": "CREATE",
        "type": "node",
        "id": alice_id,
        "label": "Person",
        "set_properties": {
            "data_source_id": "github-repo-123",
            "source_path": "MAINTAINERS.md"
            "slug": "alice-smith",
            "name": "Alice Smith",
            "email": "alice@example.com",
        }
    },
    {
        "op": "CREATE",
        "type": "edge",
        "id": relationship_id,
        "label": "KNOWS",
        "start_id": alice_id,
        "end_id": bob_id,
        "set_properties": {
            "data_source_id": "github-repo-123",
            "source_path": "MAINTAINERS.md"
            "since": 2020,
            "context": "colleagues",
        }
    }
]

# 5. Write to JSONL
with open("mutations.jsonl", "w") as f:
    for mutation in mutations:
        f.write(json.dumps(mutation) + "\n")

ID Generation

Use lowercase for entity types: "Person" → "person:..."
Use consistent slugs: "alice-smith" not "Alice Smith"
Use node label slug to check if entity exists

Use random values in slugs (breaks determinism)
Mix casing: "Alice-Smith" vs "alice-smith"

Mutation Operations

0. DEFINE (Schema Declaration)

DEFINE is required for every type you use. It creates self-documenting ontology that helps agents understand:

What this type represents
When to use it
Where it’s typically found
What properties are required vs optional

Node Type
Edge Type

{
  "op": "DEFINE",
  "type": "node",
  "label": "Person",
  "description": "A person entity representing an individual contributor, maintainer, or team member. Extracted from MAINTAINERS.md, git commit authors, @-mentions in pull requests, and people/ directory markdown files.",
  "required_properties": ["email", "name"]
}

{
  "op": "DEFINE",
  "type": "edge",
  "label": "KNOWS",
  "description": "Represents a professional relationship or acquaintance between two people, typically colleagues or collaborators. Extracted from co-authorship on pull requests, shared repository maintainership, or explicit mentions in people profiles.",
  "required_properties": ["since"]
}

Required fields:

label - The graph label, i.e. Entity Type/Relationship Type (PascalCase: "Person", "KNOWS")
description - What this type is and when to use it
required_properties - Array of property names that MUST be present. This is in addition to any globally-required properties (such as slug and data_source_id)

1. CREATE (Idempotent)

CREATE is idempotent - you can run it multiple times safely. It uses MERGE under the hood.

Node
Edge

{
  "op": "CREATE",
  "type": "node",
  "id": "person:1a2b3c4d5e6f7890",
  "label": "Person",
  "set_properties": {
    "slug": "alice-smith",
    "name": "Alice Smith",
    "github_username": "asmith",
    "data_source_id": "github-repo-123",
    "source_path": "MAINTAINERS.md"
  }
}

{
  "op": "CREATE",
  "type": "edge",
  "id": "knows:9f8e7d6c5b4a3210",
  "label": "KNOWS",
  "start_id": "person:1a2b3c4d5e6f7890",
  "end_id": "person:abcdef0123456789",
  "set_properties": {
    "since": 2020,
    "confidence": 0.95,
    "data_source_id": "github-repo-123",
    "source_path": "MAINTAINERS.md"
  }
}

Required fields:

label - Graph label (PascalCase: "Person", "Repository")
set_properties must include:
- data_source_id - Your data source identifier
- source_path - Which file this entity came from

Additional required for edges:

start_id - ID of source node
end_id - ID of target node

2. UPDATE (Partial)

UPDATE changes specific properties without affecting others.

{
  "op": "UPDATE",
  "type": "node",
  "id": "person:1a2b3c4d5e6f7890",
  "set_properties": {
    "name": "Alice Smith-Jones",
    "email": "alice.jones@example.com"
  }
}

{
  "op": "UPDATE",
  "type": "node",
  "id": "person:1a2b3c4d5e6f7890",
  "remove_properties": ["old_email", "temp_field"]
}

{
  "op": "UPDATE",
  "type": "node",
  "id": "person:1a2b3c4d5e6f7890",
  "set_properties": {
    "name": "Alice Smith-Jones"
  },
  "remove_properties": ["maiden_name"]
}

3. DELETE (Cascade)

DELETE automatically removes connected edges (uses DETACH DELETE).

{
  "op": "DELETE",
  "type": "node",
  "id": "person:obsolete123456"
}

When to use:

File was deleted from source
Entity no longer exists in external system
Cleanup during re-extraction

Operation Ordering

Operations do not need to be ordered in the JSONL data. The Graph bounded context will execute operations in the following order:

DEFINE
DELETE <edge>
DELETE <node>
CREATE <node>
CREATE <edge>
UPDATE <node>
UPDATE <edge>

JSONL Output Format

The extraction process produces a JSONL file (one JSON object per line), which might look something like:

{"op": "DEFINE","type": "node","label": "Person","description": "A person entity representing an individual contributor, maintainer, or team member. Extracted from MAINTAINERS.md, git commit authors, @-mentions in pull requests, and people/ directory markdown files.","required_properties": ["name"]}
{"op": "DEFINE","type": "edge","label": "KNOWS","description": "Represents a professional relationship or acquaintance between two people, typically colleagues or collaborators. Extracted from co-authorship on pull requests, shared repository maintainership, or explicit mentions in people profiles.","required_properties": ["since"]}
{"op": "CREATE","type": "node","id": "person:1a2b3c4d5e6f7890","label": "Person","set_properties": {"slug": "alice-smith","name": "Alice Smith","data_source_id": "ds-123","source_path": "people/alice.md"}}
{"op": "CREATE","type": "node","id": "person:abcdef0123456789","label": "Person","set_properties": {"slug": "bob-jones","name": "Bob Jones","data_source_id": "ds-123","source_path": "people/bob.md"}}
{"op": "CREATE","type": "edge","id": "knows:9f8e7d6c5b4a3210","label": "KNOWS","start_id": "person:1a2b3c4d5e6f7890","end_id": "person:abcdef0123456789","set_properties": {"since": "2020","data_source_id": "ds-123","source_path": "people/alice.md"}}

FAQs

Can I create multiple edges between the same nodes?

Yes! Different edge labels generate different IDs:

from shared_kernel.graph_primitives import EntityIdGenerator

alice_id = EntityIdGenerator.generate("Person", "alice-smith")
bob_id = EntityIdGenerator.generate("Person", "bob-jones")

# Different edge labels → different IDs
knows_id = EntityIdGenerator.generate_edge_id("KNOWS", alice_id, bob_id)
collaborates_id = EntityIdGenerator.generate_edge_id("COLLABORATES", alice_id, bob_id)

# knows_id ≠ collaborates_id (edge label is part of the hash)

What if I don’t know if an entity exists?

Use CREATE - it’s idempotent and will update if exists.

Can I batch operations?

Yes! All operations in a JSONL file execute in a single transaction.

What happens if one operation fails?

The entire batch rolls back. Fix the error and re-run.

Do I need to DEFINE types every time I run extraction?

No! DEFINE once when you first introduce a type. You can skip DEFINE in subsequent runs unless adding new types or updating definitions.