Knowledge Graph Module¶

Status: Proposal
Module: aico.ai.knowledge_graph (core AI module)
First Consumer: Semantic Memory
Research-Validated: 2025 industry best practices (GraphRAG, Graphusion, LlamaIndex, Semantic ER)

Overview¶

This document proposes a core knowledge graph module for AICO that provides property graph construction, entity resolution, and graph fusion capabilities. The module is domain-agnostic and reusable across multiple AI features.

First Application: Semantic memory deduplication (critical bug fix)
Future Applications: Relationship intelligence, autonomous agency, conversation context, emotional memory

Problem Statement (Semantic Memory)¶

Current semantic memory has a critical deduplication failure: running the same conversation twice creates duplicate facts because extraction is non-deterministic.

Run 1: "I moved to SF" → Extract → 14 facts
Run 2: "I moved to SF" → Extract → 28 facts (duplicates!)
Run 3: "I moved to SF" → Extract → 42 facts (unbounded growth)

Root Cause: Facts are stored as unstructured text without normalization. Same information expressed differently creates different embeddings, defeating similarity-based deduplication.

Proposed Solution: Core Knowledge Graph Module¶

A property graph represents knowledge as nodes (entities) and edges (relationships), both with typed properties. This provides deterministic structure for deduplication while maintaining rich metadata.

Module Architecture¶

aico/
  ai/
    knowledge_graph/              # Core module (domain-agnostic)
      __init__.py
      models.py                    # PropertyGraph, Node, Edge
      extractor.py                 # Multi-pass extraction
      entity_resolution.py         # Semantic blocking, LLM matching/merging
      fusion.py                    # Graph fusion, conflict resolution
      storage.py                   # ChromaDB + libSQL backend
      query.py                     # Graph traversal, filtering

    memory/
      semantic.py                  # Uses knowledge_graph module

Pipeline Architecture¶

Conversation → Multi-Pass Extraction → Property Graph → Semantic Entity Resolution → Graph Fusion → Storage
                    ↓                        ↓                      ↓                      ↓           ↓
                extractor.py              models.py          entity_resolution.py      fusion.py  storage.py

Core Algorithms (Conceptual)¶

1. Multi-Pass Extraction (Gleanings)¶

Problem: Single-pass extraction misses information; repeated conversations create duplicate facts.

Concept:

Pass 1: Extract entities and relations.
Pass 2+: Ask the model "what did we miss?" and add only genuinely new facts.
Final step: Infer implicit relations from accumulated context.

This produces a deterministic property graph for a conversation and significantly reduces missed information without mirroring the full Python implementation here.

2. Property Graph Model¶

Problem: Simple triplets [subject, relation, object] are too limited for rich metadata.

Conceptual model:

Node:
  id: string
  label: PERSON | PLACE | ORGANIZATION | EVENT | ...
  properties: map<string, any>   # name, age, city, etc.
  embedding: vector<float>

Edge:
  source_id: Node.id
  target_id: Node.id
  relation_type: string          # WORKS_AT, MOVED_TO, KNOWS, ...
  properties: map<string, any>   # since, until, reason, etc.

Typed nodes and edges with flexible properties make the graph expressive and easy to extend without encoding full class definitions here.

3. Semantic Entity Resolution (Multi-Tier)¶

Problem: Simple string or embedding similarity is either too weak or too noisy.

Concept:

Exact matching: Case-insensitive comparison on canonical names to merge obvious duplicates (fast, 100% precision).
Semantic blocking: Use embeddings to create candidate pairs that might match.
LLM verification: Only for ambiguous pairs, to decide if they are truly the same real-world entity.
Canonical merge: Choose a canonical node and remap edges to preserve referential integrity.

This preserves accuracy while dramatically reducing LLM calls and maintaining a clean graph, without embedding the full function implementations here.

Implementation Status: ✅ Fully Implemented with enhancements beyond original design

Research Basis: "The Rise of Semantic Entity Resolution" (TDS, Jan 2025) + production optimizations

4. Graph Fusion (Global Perspective)¶

Problem: Per-message extraction misses global relationships across history.

Concept:

Merge new graph slices into the existing graph by reusing entity resolution.
Resolve conflicting edges between the same endpoints.
Optionally infer new edges from the global structure and conversation history.

This turns incremental extractions into a coherent, evolving knowledge graph.

Storage Strategy¶

Hybrid: ChromaDB + libSQL¶

The knowledge graph module uses a hybrid storage approach leveraging AICO's existing stack:

ChromaDB Collections (Vector Search)¶

# Collection: kg_nodes
# Purpose: Semantic search over entities
node_doc = f"{node.label}: {node.properties.get('name', '')} {node.source_text}"
node_metadata = {
    'node_id': node.id,
    'label': node.label,
    'properties': json.dumps(node.properties),
    'confidence': node.confidence,
    'user_id': user_id,
    'created_at': node.created_at.isoformat()
}
chromadb.get_collection('kg_nodes').add(
    documents=[node_doc],
    embeddings=[node.embedding],
    metadatas=[node_metadata],
    ids=[node.id]
)

libSQL Tables (Relational Index)¶

Properties are stored both as JSON (source of truth) and as flattened key/value rows for efficient filtering. Database-level triggers keep these representations in sync.

Module Integration: Semantic Memory¶

At a high level, semantic memory uses the knowledge graph module by:

Running multi-pass extraction on new conversation text.
Resolving entities against the user-specific graph.
Fusing the results into the existing graph.
Persisting to the hybrid ChromaDB + libSQL backend.
Querying facts via semantic search and graph traversal.

Future Applications (Beyond Semantic Memory)¶

The knowledge graph module is designed for reuse across AICO's AI features:

1. Relationship Intelligence¶

Use case: multi-dimensional relationship understanding, privacy boundaries, context-appropriate responses via graph traversal over KNOWS, FAMILY_MEMBER, and related relations.

2. Autonomous Agency¶

Use case: proactive suggestions, goal generation, and multi-step planning by querying the user’s context graph and inferring candidate goals.

3. Conversation Context Assembly¶

Use case: multi-hop reasoning to assemble rich conversational context from the user’s graph, reducing hallucination.

4. Emotional Memory¶

Use case: attaching emotional metadata to interactions and relationships so that future responses can respect emotional history.

Performance Optimization Strategies¶

The 2500ms latency is too high for conversational UX. However, we will not dumb down the solution. Instead, we decouple processing time from conversation flow using architectural strategies.

Strategy 1: Progressive Response with Cognitive States¶

Approach: Stream AI's cognitive process to user, deliver response early, process graph in background.

Timeline:
0ms:    "Listening..." (instant feedback)
200ms:  "Understanding..." (entity extraction)
600ms:  AI RESPONSE DELIVERED ✅ (user can continue)
        ↓ [Background processing, non-blocking]
1800ms: "Storing in memory..." (graph construction)
2500ms: Complete (memory fully processed)

Strategy 2: Lazy Graph Construction¶

Approach: Respond immediately with simple extraction, then build the full knowledge graph during conversation pauses (idle time). This keeps perceived latency low while still building a rich graph in the background.

Strategy 3: Incremental Graph Construction¶

Approach: Build graph incrementally across multiple turns, not all at once.

Strategy 4: Parallel Processing¶

Approach: Run extraction passes and entity resolution in parallel where possible to reduce background latency, without changing the conversational path.

Strategy 5: Smart Caching¶

Approach: Cache expensive LLM operations for repeated entities.

Strategy 6: Cognitive State UI¶

Approach: Show users AICO's "inner monologue" (what is being extracted/stored) to make processing time feel intentional and transparent. └─ Connecting to previous locations

**Benefits:** Transforms wait time into feature, builds trust, reduces perceived latency.

---

## Recommended Approach

**Hybrid Strategy:**

### **Implementation (Required)**
Combines progressive response delivery with parallel processing for optimal user experience.

**Features:**
- Progressive response with cognitive states (600ms user-perceived latency)
- Parallel extraction and entity resolution (1200ms background vs 2340ms sequential)
- Background graph processing (non-blocking)
- Transparent UX with visible cognitive states

**Benefits:**
- ✅ 600ms user response time
- ✅ 1200ms background processing (49% faster than sequential)
- ✅ Full state-of-the-art quality, zero dumbing down
- ✅ Transparent UX builds trust

### **Enhancement (Optional)**
Additional optimizations for scaling and performance improvements.

**Features:**
- Smart caching for repeated entities (50ms vs 640ms, 92% faster)
- Lazy graph construction during idle time (2-3s gaps between messages)
- Incremental graph building across conversation turns

**Benefits:**
- ✅ Dramatic speedup for common entities (names, places)
- ✅ Zero user-perceived impact (processing during pauses)
- ✅ Distributed compute load across conversation
- ✅ Scales with conversation length

**Configuration:**
```yaml
core:
  ai:
    knowledge_graph:
      processing:
        # Implementation features are hardcoded (progressive response, parallel processing, background processing)
        # Only performance tuning is configurable:
        caching_enabled: true            # Enable entity resolution caching (recommended)
        cache_size: 1000                 # Number of entities to cache (adjust based on memory)

Result: 600ms user response, 1200ms background processing, full state-of-the-art quality with optional caching for scaling.

Note: Implementation features are required for MVP. Enhancement features can be added later based on real-world performance testing and scaling needs.

Implementation Roadmap¶

Phase 0: Cleanup & Preparation¶

Scope: Remove obsolete code and prepare codebase for knowledge graph implementation

Deliverables:

Code Cleanup: - Remove legacy fact extraction code from aico/ai/memory/semantic.py - Remove obsolete AdvancedFactExtractor class and related utilities - Clean up any placeholder/simulation code in memory evaluation - Remove unused imports and dead code paths

libSQL Database Changes¶

❌ Remove (Schema Version 6): - facts_metadata table → replaced by kg_nodes - fact_relationships table → replaced by kg_edges - session_metadata table → no longer needed (LMDB coordination)

✅ Add (Schema Version 7): - kg_nodes table (with triggers for property indexing) - kg_edges table (with triggers for property indexing) - kg_node_properties table (property index) - kg_edge_properties table (property index)

📝 Notes: - No data migration needed (not in production) - Ensure aico db init remains idempotent

ChromaDB Changes¶

❌ Remove: - user_facts collection (fact-based semantic memory) - Collection initialization code in cli/commands/database.py (lines 87-105)

✅ Add: - kg_nodes collection (semantic search over entities) - kg_edges collection (semantic search over relationships)

📝 Notes: - No data migration needed (not in production) - Complete replacement of old collection structure

LMDB Changes¶

❌ Remove: - user_sessions named database from config (unused - not referenced in code)

✅ Keep (No Changes): - LMDB working memory database (separate system for conversation context) - session_memory named database (actively used by WorkingMemoryStore) - LMDB initialization in cli/commands/database.py (line 350) - cli/utils/lmdb_utils.py (working memory management)

🔄 Update: - config/defaults/core.yaml: Remove user_sessions from core.memory.working.named_databases list

📝 Notes: - Working memory (LMDB) handles short-term conversation context (TTL: 24 hours) - Semantic memory (knowledge graph) handles long-term facts (permanent) - These are separate, independent systems

Documentation: - Document what was removed and why - Update architecture diagrams to reflect new knowledge graph approach - Create migration guide for any existing deployments

Validation: - Verify aico db init still works (idempotent) - Verify no broken imports or references to removed code - Run existing tests to ensure nothing breaks - Confirm clean slate for knowledge graph implementation

Note: This phase follows the principle of complete cleanup - no backwards compatibility, no "just in case" code. All obsolete components are fully removed before implementing the new system.

Phase 1: Core Module Foundation¶

Scope: Basic module structure and data models

Deliverables: - Create aico/ai/knowledge_graph/ module structure - Implement models.py (PropertyGraph, Node, Edge dataclasses) - Implement storage.py (ChromaDB + libSQL hybrid backend) - Create libSQL Schema Version 7: - kg_nodes table with JSON properties - kg_edges table with JSON properties - kg_node_properties table (property index) - kg_edge_properties table (property index) - Database triggers for automatic property indexing (verified working in libsql 0.1.8) - Initialize ChromaDB collections (kg_nodes, kg_edges) - Implement dual-write to both ChromaDB and libSQL - Basic CRUD operations (create, read, update, delete) - Document property conventions for future applications - Unit tests for data models and storage

Phase 2: Multi-Pass Extraction¶

Scope: Implement extraction pipeline

Deliverables: - Implement extractor.py (multi-pass extraction with gleanings) - GLiNER entity extraction (Pass 1) - LLM relation extraction (Pass 1) - Gleaning extraction (Pass 2+) - Novel inference from conversation history (Pass N) - Benchmark completeness improvements (60-70% → 90%+) - Unit tests for extraction

Phase 3: Semantic Entity Resolution¶

Scope: Implement deduplication

Deliverables: - Implement entity_resolution.py - Semantic blocking (HDBSCAN clustering) - LLM-based matching (with chain-of-thought) - LLM-based merging (with conflict resolution) - Test deduplication accuracy (target: 95%+) - Unit tests for entity resolution

Future Enhancements¶

Priority 1: Critical Additions (Phase 1.5)¶

1. Temporal/Bi-Temporal Data Model ⭐ HIGH PRIORITY¶

Problem: Current schema only tracks when facts were recorded (created_at, updated_at), not when they were valid in real life.

Why Critical: - Relationship evolution: "Sarah was my girlfriend" → "Sarah is my wife" (temporal validity) - Historical context: "What was I working on last month?" requires point-in-time queries - Autonomous agency: Planning requires understanding temporal sequences - Emotional memory: "How did I feel about X over time?" needs temporal tracking

Implementation:

Add indexed temporal fields to tables:

ALTER TABLE kg_nodes ADD COLUMN valid_from TEXT;
ALTER TABLE kg_nodes ADD COLUMN valid_until TEXT;
ALTER TABLE kg_nodes ADD COLUMN is_current INTEGER DEFAULT 1;

ALTER TABLE kg_edges ADD COLUMN valid_from TEXT;
ALTER TABLE kg_edges ADD COLUMN valid_until TEXT;
ALTER TABLE kg_edges ADD COLUMN is_current INTEGER DEFAULT 1;

-- Indexes for temporal queries
CREATE INDEX idx_kg_nodes_temporal ON kg_nodes(user_id, is_current, valid_from);
CREATE INDEX idx_kg_edges_temporal ON kg_edges(user_id, is_current, valid_from);

Property conventions (stored in JSON properties field):

temporal:
  valid_from: "2024-01-01T00:00:00Z"    # When fact became true (event time)
  valid_until: "2025-12-31T23:59:59Z"   # When fact stopped being true (null = current)
  recorded_at: "2024-01-15T10:30:00Z"   # When AICO learned about it (ingestion time)
  is_current: true                       # Quick filter for active facts

Benefits: - Point-in-time queries: "Show my relationships as of 6 months ago" - Temporal reasoning: "What changed since last week?" - Real-time incremental updates without batch reprocessing - Foundation for autonomous agency temporal planning

Research Basis: Graphiti/Zep's bi-temporal model (2025) - state-of-the-art for agent memory

2. Personal Graph Layer ⭐ HIGH PRIORITY¶

Problem: Current proposal focuses on knowledge graph (facts about the world) but lacks personal graph (user's activities, projects, goals).

Why Critical: - Autonomous agency: Requires understanding user's active projects, priorities, goals - Proactive engagement: "You mentioned wanting to learn piano—here's a practice reminder" - Context assembly: "What am I currently working on?" needs activity tracking - Relationship intelligence: Collaboration patterns, interaction frequency

Implementation:

New node labels (use existing kg_nodes.label field):

# Personal graph entities
- PROJECT: User's active projects
- GOAL: User's objectives (short/long-term)
- TASK: Actionable items
- ACTIVITY: User actions (created doc, attended meeting, etc.)
- INTEREST: User's developing interests
- PRIORITY: User's current priorities

New edge types (use existing kg_edges.relation_type field):

# Personal graph relationships
- WORKING_ON: User → Project
- HAS_GOAL: User → Goal
- CONTRIBUTES_TO: Task → Goal
- DEPENDS_ON: Task → Task (dependencies)
- COLLABORATES_WITH: User → Person (on Project)
- INTERESTED_IN: User → Topic
- PRIORITIZES: User → Priority

Property conventions for personal graph:

# Project properties
project:
  status: "active"                       # active/paused/completed
  progress: 0.6                          # Float 0-1
  deadline: "2025-12-31T23:59:59Z"      # Target completion
  priority: 1                            # Int 1-5 (1=highest)

# Goal properties
goal:
  type: "short_term"                     # short_term/long_term
  status: "in_progress"                  # pending/in_progress/achieved/abandoned
  motivation: "personal_growth"          # Why user wants this

# Activity properties
activity:
  activity_type: "document_created"      # Type of activity
  timestamp: "2025-10-30T10:00:00Z"     # When activity occurred
  duration_minutes: 45                   # How long it took
  context: "work"                        # work/personal/learning

Benefits: - Proactive assistance: Surface priorities, detect conflicts - Collaboration pattern detection - Personalized context assembly - Foundation for autonomous goal generation

Research Basis: Glean's Personal Graph (2025) - activity tracking + LLM reasoning for work context

3. Graph Traversal & Multi-Hop Reasoning ⭐ MEDIUM PRIORITY¶

Problem: Current proposal has basic CRUD but lacks graph traversal algorithms for multi-hop queries.

Why Critical: - Context assembly: "Find all information related to Sarah's piano recital" (multi-hop) - Relationship intelligence: "How do I know John?" (path finding) - Autonomous agency: "What dependencies block this goal?" (dependency chains)

Implementation:

Add to aico/ai/knowledge_graph/query.py:

async def traverse(
    start_node: str,
    relation_types: List[str],
    max_depth: int = 3,
    filters: Dict = None
) -> PropertyGraph:
    """Multi-hop graph traversal with filtering

    Example: Find all entities connected to "Sarah" within 2 hops
    via KNOWS or FAMILY_MEMBER relationships
    """

async def find_path(
    source: str,
    target: str,
    max_depth: int = 5
) -> List[Path]:
    """Find shortest paths between entities

    Example: "How do I know John?" → User → Sarah → John
    """

async def get_neighborhood(
    node_id: str,
    depth: int = 2,
    relation_filter: List[str] = None
) -> PropertyGraph:
    """Get local subgraph around entity

    Example: Get all entities within 2 hops of "piano_recital" event
    """

async def find_dependencies(
    node_id: str,
    relation_type: str = "DEPENDS_ON"
) -> List[Node]:
    """Find dependency chains for tasks/goals

    Example: "What blocks this goal?" → Task1 → Task2 → Task3
    """

SQL Implementation:

-- Recursive CTE for graph traversal (libSQL supports this)
WITH RECURSIVE graph_traversal(node_id, depth, path) AS (
    SELECT id, 0, id FROM kg_nodes WHERE id = ?
    UNION ALL
    SELECT e.target_id, gt.depth + 1, gt.path || ',' || e.target_id
    FROM graph_traversal gt
    JOIN kg_edges e ON gt.node_id = e.source_id
    WHERE gt.depth < ? AND e.relation_type IN (?)
)
SELECT DISTINCT node_id FROM graph_traversal;

Benefits: - Rich context retrieval with relationship awareness - Path finding for relationship intelligence - Dependency analysis for autonomous agency - Foundation for complex reasoning

Priority 2: Important Enhancements (Phase 2)¶

4. Graph-Based Context Ranking ⭐ MEDIUM PRIORITY¶

Problem: Current proposal uses semantic similarity for retrieval but doesn't leverage graph structure for ranking.

Why Important: - Better context: Entities with more connections are more central/important - Relationship-aware retrieval: "Sarah" (close friend) ranks higher than "Sarah" (mentioned once) - Temporal relevance: Recent facts rank higher than old facts

Implementation:

Add graph metrics to nodes:

# Computed metrics (stored in properties JSON or separate fields)
graph_metrics:
  degree_centrality: 0.85              # How many connections? (0-1)
  temporal_recency: 0.92               # How recently discussed? (0-1)
  interaction_frequency: 45            # How often mentioned? (count)
  emotional_salience: 0.75             # Emotional intensity (0-1)
  importance_score: 0.87               # Combined importance (0-1)

Context ranking algorithm:

def rank_context(nodes: List[Node], query: str) -> List[Node]:
    """Rank retrieved nodes by combined score"""
    for node in nodes:
        semantic_sim = compute_similarity(query, node.embedding)
        graph_centrality = node.properties.get('graph_metrics', {}).get('degree_centrality', 0)
        temporal_recency = compute_recency(node.updated_at)
        emotional_salience = node.properties.get('graph_metrics', {}).get('emotional_salience', 0)

        # Weighted combination
        node.context_score = (
            0.4 * semantic_sim +
            0.3 * graph_centrality +
            0.2 * temporal_recency +
            0.1 * emotional_salience
        )

    return sorted(nodes, key=lambda n: n.context_score, reverse=True)

Benefits: - More relevant context retrieval - Relationship-aware ranking - Temporal and emotional awareness - Better than pure semantic similarity

5. Entity Disambiguation & Canonical IDs ⭐ MEDIUM PRIORITY¶

Problem: Entity resolution merges duplicates but doesn't maintain canonical entity IDs for disambiguation.

Why Important: - Multi-modal recognition: Voice says "Sarah" → Which Sarah? (use graph context) - Relationship intelligence: "Sarah" (daughter) vs "Sarah" (colleague) - Cross-conversation consistency: Same entity across sessions

Implementation:

Add to kg_nodes table:

ALTER TABLE kg_nodes ADD COLUMN canonical_id TEXT;
ALTER TABLE kg_nodes ADD COLUMN aliases_json TEXT;  -- ["SF", "San Francisco", "The City"]
CREATE INDEX idx_kg_nodes_canonical ON kg_nodes(canonical_id);

Property conventions:

disambiguation:
  canonical_id: "person_sarah_001"       # Stable ID across merges
  aliases: ["Sarah", "Sarah M.", "Mom"]  # Known variations
  disambiguation_context:
    relationship: "daughter"              # How related to user
    age: 8                                # Disambiguating attribute
    primary_context: "family"            # Main context for this entity

Entity resolution enhancement:

async def resolve_entity(
    mention: str,
    context: Dict
) -> Node:
    """Resolve ambiguous entity mention using graph context

    Example: "Sarah" + context{"conversation_topic": "piano"} 
             → Sarah (daughter) not Sarah (colleague)
    """
    candidates = await search_nodes(mention)
    if len(candidates) == 1:
        return candidates[0]

    # Use graph context for disambiguation
    for candidate in candidates:
        score = compute_context_match(candidate, context)
        candidate.disambiguation_score = score

    return max(candidates, key=lambda c: c.disambiguation_score)

Benefits: - Accurate entity resolution in conversations - Multi-modal recognition support - Cross-session consistency - Foundation for relationship intelligence

6. Conflict Resolution & Fact Versioning ⭐ LOW PRIORITY¶

Problem: Current fusion has conflict resolution but no version history for facts.

Why Useful: - Debugging: "Why does AICO think I live in SF?" (trace fact provenance) - Correction: "Actually, I moved to NYC" (update with history) - Trust: Show users how facts evolved over time - Audit trail: Track how knowledge changed

Implementation:

Add optional history table:

href="#__codelineno-19-1">CREATE TABLE IF NOT EXISTS kg_node_history ( id TEXT PRIMARY KEY, node_id TEXT NOT NULL, version INTEGER NOT NULL, properties JSON NOT NULL, valid_from TEXT NOT NULL, valid_until TEXT, created_at TEXT NOT NULL, change_reason TEXT, -- "user_correction", "conflict_resolution", "new_information" FOREIGN KEY (node_id) REFERENCES kg_nodes(id) ON DELETE CASCADE, INDEX idx_node_history_node (node_id, version) ); CREATE TABLE IF NOT EXISTS kg_edge_history ( id TEXT PRIMARY KEY, edge_id TEXT NOT NULL, version INTEGER NOT NULL, properties JSON NOT NULL, valid_from TEXT NOT NULL, valid_until TEXT, created_at TEXT NOT NULL, change_reason TEXT, FOREIGN KEY (edge_id) REFERENCES kg_edges(id) ON DELETE CASCADE, INDEX idx_edge_history_edge (edge_id, version) );

Versioning logic:

async def update_node_with_history(
    node_id: str,
    new_properties: Dict,
    change_reason: str
):
    """Update node and preserve history"""
    # Get current version
    current = await get_node(node_id)

    # Archive current version
    await archive_node_version(
        node_id=node_id,
        version=current.version,
        properties=current.properties,
        change_reason=change_reason
    )

    # Update to new version
    await update_node(node_id, new_properties, version=current.version + 1)

Benefits: - Fact provenance tracking - User trust through transparency - Debugging and correction support - Audit trail for compliance

Priority 3: Advanced Features (Phase 3)¶

7. Graph Analytics & Insights ⭐ LOW PRIORITY¶

Problem: No graph algorithms for discovering patterns and insights.

Why Useful: - Autonomous agency: Detect emerging interests, suggest goals - Relationship intelligence: Identify relationship clusters, detect drift - Proactive engagement: "You haven't talked to John in 3 months" - Self-awareness: Help user understand their own patterns

Implementation:

Add analytics module aico/ai/knowledge_graph/analytics.py:

async def detect_communities(
    user_id: str
) -> List[Community]:
    """Identify relationship clusters using community detection

    Example: Family cluster, work cluster, hobby cluster
    """

async def compute_centrality(
    user_id: str,
    metric: str = "degree"  # degree, betweenness, closeness
) -> Dict[str, float]:
    """Compute node importance using centrality measures

    Example: Most important people, topics, projects
    """

async def detect_anomalies(
    user_id: str,
    time_window: str = "7d"
) -> List[Anomaly]:
    """Detect unusual patterns in user's graph

    Example: Sudden drop in communication with close friend
    """

async def analyze_trends(
    user_id: str,
    entity_type: str = "INTEREST"
) -> List[Trend]:
    """Analyze emerging or declining patterns

    Example: Growing interest in photography, declining interest in gaming
    """

async def suggest_goals(
    user_id: str
) -> List[Goal]:
    """Generate goal suggestions based on graph patterns

    Example: User talks about learning piano → Suggest "Learn piano" goal
    """

Benefits: - Proactive goal suggestions - Relationship health monitoring - Pattern discovery and insights - Foundation for true autonomous agency

Summary: Enhancement Priorities¶

Phase 1.5 (Critical - Add to MVP): 1. ✅ Temporal/bi-temporal data model (HIGH) 2. ✅ Personal graph layer (HIGH) 3. ✅ Graph traversal & multi-hop reasoning (MEDIUM)

Phase 2 (Important - Post-MVP): 4. Graph-based context ranking (MEDIUM) 5. Entity disambiguation & canonical IDs (MEDIUM) 6. Conflict resolution & fact versioning (LOW)

Phase 3 (Advanced - Future): 7. Graph analytics & insights (LOW)

Research Foundation: - Graphiti/Zep (2025): Bi-temporal knowledge graphs for agent memory (state-of-the-art) - Glean Personal Graph (2025): Activity tracking + LLM reasoning for work context - Industry Best Practices: Multi-hop reasoning, graph-based ranking, entity disambiguation

Coreference Resolution Optimizations for Property Graph¶

Current Implementation Limitations¶

The current coreference resolution approach has some challenges for optimal property graph construction:

1. Over-Resolution Problem¶

# Input: "John and I are working together. We think it will succeed."
# Current: "John and Michael are working together. John and Michael think it will succeed."
# Issue: Loses collective relationship nature ("We" becomes individual actions)

2. Relationship Ambiguity¶

# Property Graph Issue:
# Creates: Michael -[THINKS]-> "it will succeed"
#         John -[THINKS]-> "it will succeed"
# Should be: (John, Michael) -[COLLECTIVELY_THINK]-> "it will succeed"

Required Optimizations¶

1. Selective Resolution Modes¶

Full Resolution: Complete pronoun resolution (current approach)
Selective Resolution: Only resolve entity-referring pronouns
Graph-Optimized: Preserve collective pronouns and relationship context

2. Enhanced Output Structure¶

{
    'resolved_text': str,           # Fully resolved text
    'partial_resolution': str,      # Selectively resolved text  
    'entity_mappings': dict,        # Pronoun -> Entity mappings
    'collective_pronouns': list,    # Preserved group pronouns
    'relationship_context': dict    # Relationship preservation info
}

3. Relationship Preservation¶

Maintain collective actions ("we", "us", "our")
Preserve temporal/causal sequences
Keep relationship context for proper graph edge creation

4. Implementation Strategy¶

Phase 1: Current cross-turn resolution (✅ implemented)
Phase 2: Add property graph optimization modes
Phase 3: Direct integration with graph construction

This ensures clean entity resolution while preserving the relationship semantics needed for accurate property graph representation.ons) - Validate global perspective - Unit tests for fusion

Phase 5: Semantic Memory Integration¶

Scope: Integrate knowledge graph with semantic memory Deliverables: - Refactor aico/ai/memory/semantic.py to use knowledge graph module - Implement progressive response with parallel processing (required features) - Implement background graph processing (non-blocking) - Add property convention validation (temporal, provenance, emotional, etc.) - Store properties following documented conventions - Integration tests - Deduplication test: stable fact count across runs (14 → 14 → 14, not 14 → 28 → 42)

Phase 6: Testing & Optimization¶

Scope: Comprehensive testing and performance optimization

Deliverables: - End-to-end tests (full pipeline) - Performance benchmarks (latency, cost, accuracy) - Implement caching (optional enhancement) - Documentation (API docs, usage examples) - Update configuration with property conventions

Configuration¶

Note: Knowledge graph uses existing model configuration from core.yaml. Models are managed centrally: - Entity extraction: Uses modelservice.transformers.models.entity_extraction (GLiNER) - Embeddings: Uses modelservice.default_models.embedding (paraphrase-multilingual) - LLM operations: Uses modelservice.default_models.conversation (hermes3:8b)

memory:
  semantic:
    # Knowledge graph configuration
    knowledge_graph:
      # Extraction settings
      max_gleanings: 2  # Number of gleaning passes (0-2 recommended)

      # Entity resolution (deduplication)
      deduplication:
        enabled: true
        similarity_threshold: 0.85  # Cosine similarity for semantic blocking

      # Performance tuning
      caching:
        enabled: true  # Cache entity resolution results
        cache_size: 1000  # Number of entities to cache

What's NOT configurable (hardcoded in implementation): - Storage backend: chromadb+libsql (hybrid, required) - Processing mode: Progressive response with parallel processing - Property conventions: Documented in Phase 0 - Model selection: Uses existing modelservice configuration - Collections: kg_nodes, kg_edges (both ChromaDB and libSQL) - Paths: Resolved via AICOPaths.get_semantic_memory_path()

Success Criteria¶

Deduplication Test¶

# Run 3 times with same user
Run 1: 14 facts
Run 2: 14 facts (not 28!)
Run 3: 14 facts (stable!)

Completeness Test¶

single_pass: 8 facts (57%)
multi_pass: 13 facts (93%)

Performance Test¶

Total latency: 2500ms ✅ (under 3s target)
Cost per conversation: $0.003 ✅ (acceptable)
Deduplication accuracy: 95%+ ✅

Alignment with AICO Principles¶

Local-First, Privacy-First¶

✅ All processing local except LLM matching/merging (gpt-4o-mini, optional)
✅ Property graph enables fine-grained privacy boundaries
✅ Hybrid storage (ChromaDB + libSQL) keeps data local
✅ No external dependencies for core functionality

Modular, Message-Driven Design¶

✅ Core knowledge graph module (aico.ai.knowledge_graph)
✅ Domain-agnostic, reusable across features
✅ Clean interfaces (PropertyGraph, Extractor, EntityResolver, GraphFusion)
✅ Feature flag for gradual rollout (pipeline_mode)
✅ Follows System > Domain > Module > Component hierarchy

Extensibility¶

✅ Plugin-ready: other modules can use knowledge graph
✅ Storage backend abstraction (ChromaDB+libSQL now, Neo4j future)
✅ Model abstraction (swap LLMs via modelservice)

Autonomous Agency (Future)¶

✅ Graph structure enables goal planning
✅ Multi-hop reasoning for proactive suggestions
✅ Context-aware decision making

Real-Time Emotional Intelligence (Future)¶

✅ Emotional context in relationships
✅ Emotional memory integration
✅ Relationship-appropriate empathy

Natural Family Recognition (Future)¶

✅ Rich relationship modeling
✅ Multi-dimensional understanding
✅ Dynamic learning from interactions

Research Foundation¶

Microsoft GraphRAG (2024-2025) - Multi-pass extraction, hierarchical clustering
Graphusion (ACL 2024) - Global perspective fusion, conflict resolution
LlamaIndex PropertyGraph (2024-2025) - Property graph model, schema-guided extraction
Semantic Entity Resolution (Jan 2025) - Embedding clustering + LLM validation
Ditto (2020) - Deep entity matching with pre-trained LLMs (29% improvement)

Current Limitations & Future Enhancements¶

1. Additional Data Sources¶

Current: Knowledge graph only extracts from user conversations Limitation: Cannot incorporate external knowledge or structured data sources Future Enhancement: - Import from calendar events, emails, documents - Integration with external APIs (LinkedIn, Google Calendar, etc.) - Manual fact entry via CLI/UI - Bulk import from structured data (CSV, JSON)

2. Extended Relationship Types¶

Current: Basic relationship types extracted from conversation (WORKS_AT, LIVES_IN, KNOWS) Limitation: Limited semantic expressiveness for complex relationships Future Enhancement: - Hierarchical relationship taxonomy (IS_A, PART_OF, BELONGS_TO) - Temporal relationships (WORKED_AT_FROM_TO, LIVED_IN_UNTIL) - Causal relationships (CAUSED_BY, RESULTED_IN) - Emotional relationships (FEELS_ABOUT, REMINDS_OF) - Probabilistic relationships with confidence scores - Custom user-defined relationship types

3. Graph Analytics¶

Current: Simple node/edge retrieval via semantic search Limitation: No graph-level analysis or pattern detection Future Enhancement: - Centrality Analysis: Identify most important entities in user's life - Community Detection: Discover clusters of related entities (work, family, hobbies) - Path Finding: Multi-hop reasoning ("How is X connected to Y?") - Anomaly Detection: Identify unusual patterns or contradictions - Trend Analysis: Track how relationships evolve over time - Influence Propagation: Understand how changes affect connected entities

Current: Each user has isolated knowledge graph (user_id scoping) Status: ✅ Already Implemented - Data is user-bound via user_id in all tables Clarification: "Single-user graphs" refers to lack of cross-user knowledge sharing, not lack of user isolation Future Enhancement: - Shared Knowledge Base: Common facts accessible to all users (e.g., "Paris is in France") - Privacy-Aware Sharing: Users can opt-in to share specific facts - Collaborative Learning: System learns from aggregate patterns across users - Entity Disambiguation: Leverage cross-user data to resolve ambiguous entities - Collective Intelligence: Improve extraction quality using multi-user validation

Example Use Cases: - Shared organizational knowledge (company structure, policies) - Public figure information (celebrities, politicians) - Common knowledge facts (geography, history) - Collaborative workspaces (team projects, shared goals)

Privacy Considerations: - Default: All user data is private and isolated - Opt-in: Users explicitly choose what to share - Anonymization: Shared data stripped of personal identifiers - Access Control: Fine-grained permissions for shared knowledge

Query Language: GQL/Cypher via GrandCypher¶

Why GQL?¶

GQL (Graph Query Language) is the new ISO standard (ISO/IEC 39075:2024) for property graphs, published April 2024. It's the first new ISO database language since SQL, designed specifically for graph databases.

Implementation: GrandCypher¶

AICO uses GrandCypher, a pure Python implementation of Cypher (90% compatible with GQL):

Benefits: - ✅ ISO Standard Syntax - Future-proof, industry-wide adoption expected - ✅ No Neo4j Dependency - Works with our libSQL + ChromaDB backend - ✅ Pure Python - Easy installation, no C compilation required - ✅ Extensible - Can add GQL features as standard evolves - ✅ Production-Ready - Used by research labs and production systems

Example Query:

MATCH (user:PERSON {name: 'Geralt'})-[:WORKS_FOR]->(company)
RETURN company.name, company.properties

Supported Features: - Pattern matching: MATCH (a)-[r]->(b) - Filtering: WHERE a.property = 'value' - Aggregations: COUNT, SUM, AVG, MIN, MAX - Ordering: ORDER BY, LIMIT, SKIP - Type filtering: (:PERSON), [:WORKS_FOR] - Multi-hop traversal: (a)-[]->(b)-[]->(c)

Security: - All queries automatically scoped to user_id - Query validation prevents injection attacks - Execution timeouts prevent DoS - Result size limits prevent memory exhaustion

API Design: GQL-First Approach¶

Design Philosophy¶

AICO uses a GQL-first API design following industry best practices:

Single powerful query endpoint (GQL/Cypher) for all read operations
Minimal REST endpoints for common operations and statistics
Programmatic access via GQL for complex queries

Why GQL-First? - ✅ Flexibility: One endpoint handles all query patterns - ✅ Efficiency: Client requests exactly what they need - ✅ Simplicity: 2 endpoints vs 31 specialized endpoints - ✅ Standards-based: ISO GQL standard (ISO/IEC 39075:2024) - ✅ Future-proof: Query language evolves without API changes

Industry Examples: - Neo4j: Single Cypher endpoint + minimal REST - GraphQL: Single endpoint for all queries - Dgraph: Single GraphQL+- endpoint

Implemented Endpoints¶

Core API (2 endpoints)¶

1. POST /api/v1/kg/query - Execute GQL/Cypher queries ✅

# All operations via GQL:

# Get full graph
MATCH (n) RETURN n

# List nodes by type
MATCH (n:PERSON) RETURN n.name, n.properties

# Semantic search (via properties)
MATCH (n) WHERE n.name CONTAINS 'Sarah' RETURN n

# Get neighbors
MATCH (n {id: 'node_123'})-[r]-(m) RETURN n, r, m

# Find path
MATCH path = shortestPath((a)-[*]-(b))
WHERE a.name = 'Sarah' AND b.name = 'TechCorp'
RETURN path

# Analytics - centrality
MATCH (n)-[r]-() 
RETURN n.name, count(r) as degree 
ORDER BY degree DESC

# Temporal queries
MATCH (n) 
WHERE n.is_current = 1 AND n.valid_from <= '2025-01-01'
RETURN n

# Complex multi-hop
MATCH (p:PERSON)-[:WORKS_FOR]->(c:ORG)-[:LOCATED_IN]->(city)
RETURN p.name, c.name, city.name

Supported Features: - Pattern matching: MATCH (a)-[r]->(b) - Filtering: WHERE, AND, OR - Aggregations: COUNT, SUM, AVG, MIN, MAX - Ordering: ORDER BY, LIMIT, SKIP - Multi-hop: (a)-[*1..3]->(b) - Shortest path: shortestPath()

2. GET /api/v1/kg/stats - Graph statistics ✅

{
  "total_nodes": 150,
  "total_edges": 320,
  "node_types": {"PERSON": 45, "ORG": 30, "PROJECT": 25},
  "edge_types": {"WORKS_FOR": 80, "KNOWS": 120}
}

Why Not 31 Endpoints?¶

Original Design Issues: 1. ❌ Over-engineering: 31 specialized endpoints for one feature 2. ❌ Maintenance burden: Each endpoint needs docs, tests, versioning 3. ❌ Inflexibility: New query patterns require new endpoints 4. ❌ Client complexity: Clients must learn 31 different APIs

GQL-First Benefits: 1. ✅ Single learning curve: Learn GQL once, query anything 2. ✅ Composability: Combine operations in one query 3. ✅ Efficiency: Reduce round-trips (get related data in one call) 4. ✅ Maintainability: One endpoint to secure, test, document

Example Comparison:

# REST approach (3 requests):
GET /api/v1/kg/nodes/123
GET /api/v1/kg/neighbors/123
GET /api/v1/kg/edges?source_id=123

# GQL approach (1 request):
MATCH (n {id: '123'})-[r]-(m)
RETURN n, r, m

Future Additions (If Needed)¶

Only add REST endpoints if: 1. High-frequency operations that benefit from caching 2. Non-technical users need simple URLs 3. External integrations require REST

Potential additions: - GET /api/v1/kg/export - Export full graph - POST /api/v1/kg/import - Bulk import - GET /api/v1/kg/schema - Schema introspection

Total: 2 core endpoints (vs 31 proposed) following modern API design best practices.

Implementation Status¶

✅ Completed (Production-Ready)¶

Phase 1-5: Core Implementation (100%) - ✅ Property graph models with temporal support - ✅ Hybrid storage (ChromaDB + libSQL) - ✅ Multi-pass extraction with gleanings - ✅ Enhanced multi-tier entity resolution (60-80% LLM reduction) - ✅ Graph fusion with conflict resolution - ✅ Bonus edge integrity with atomic updates - ✅ KG consolidation scheduler

Phase 1.5: High-Priority Enhancements (100%) - ✅ Bi-temporal model (valid_from/until, is_current) - ✅ Personal graph layer (PROJECT, GOAL, TASK labels) - ✅ Graph traversal (BFS/DFS, path finding)

Phase 2: Medium-Priority Enhancements (100%) - ✅ Property index tables with automatic triggers - ✅ Canonical IDs and entity disambiguation

Phase 3: Advanced Features (100%) - ✅ Graph analytics (PageRank, centrality, clustering) - ✅ GQL/Cypher query support (GrandCypher)

API Layer (Minimal by Design) - ✅ GQL query endpoint (primary interface) - ✅ Statistics endpoint - ⚠️ Additional REST endpoints: Not needed (GQL-first approach)

🎉 Key Achievements¶

Multi-Tier Entity Resolution (Beyond Original Design)
Tier 1: Exact matching (60-80% of cases, instant)
Tier 2: LLM verification (fuzzy cases only)
Result: 60-80% fewer LLM calls, same accuracy
Edge Integrity (Critical Addition)
Node mapping during merge
Atomic edge updates
100% referential integrity
Simplified API (Best Practice)
GQL-first approach (2 endpoints vs 31)
ISO standard query language
More flexible, easier to maintain

📊 Performance Results¶

Deduplication: - ✅ Stable entity count across runs (14 → 14 → 14) - ✅ 95%+ accuracy maintained - ✅ 60-80% fewer LLM calls - ✅ 44% faster processing (146s vs 260s)

Edge Integrity: - ✅ 100% edges point to current nodes - ✅ 0 broken references (was 4/7) - ✅ Atomic updates with transactions

Extraction Completeness: - ✅ Multi-pass extraction working - ✅ Personal graph labels recognized - ✅ Semantic label correction active

Conclusion¶

The knowledge graph implementation exceeds the original proposal with production-grade enhancements:

Multi-tier entity resolution reduces costs while maintaining accuracy
Edge integrity ensures data consistency
GQL-first API follows modern best practices
Complete backend with all core features implemented

The system is production-ready and solves the deduplication problem while providing foundation for relationship intelligence, autonomous agency, and emotional memory.

Status: ✅ Implementation Complete - Ready for production use

Knowledge Graph Module¶

Overview¶

Problem Statement (Semantic Memory)¶

Proposed Solution: Core Knowledge Graph Module¶

Module Architecture¶

Pipeline Architecture¶

Core Algorithms (Conceptual)¶

1. Multi-Pass Extraction (Gleanings)¶

2. Property Graph Model¶

3. Semantic Entity Resolution (Multi-Tier)¶

4. Graph Fusion (Global Perspective)¶

Storage Strategy¶

Hybrid: ChromaDB + libSQL¶

ChromaDB Collections (Vector Search)¶

libSQL Tables (Relational Index)¶

Module Integration: Semantic Memory¶

Future Applications (Beyond Semantic Memory)¶

1. Relationship Intelligence¶

2. Autonomous Agency¶

3. Conversation Context Assembly¶

4. Emotional Memory¶

Performance Optimization Strategies¶

Strategy 1: Progressive Response with Cognitive States¶

Strategy 2: Lazy Graph Construction¶

Strategy 3: Incremental Graph Construction¶

Strategy 4: Parallel Processing¶

Strategy 5: Smart Caching¶

Strategy 6: Cognitive State UI¶

Implementation Roadmap¶

Phase 0: Cleanup & Preparation¶

libSQL Database Changes¶

ChromaDB Changes¶

LMDB Changes¶

Phase 1: Core Module Foundation¶

Phase 2: Multi-Pass Extraction¶

Phase 3: Semantic Entity Resolution¶

Future Enhancements¶

Priority 1: Critical Additions (Phase 1.5)¶

1. Temporal/Bi-Temporal Data Model ⭐ HIGH PRIORITY¶

2. Personal Graph Layer ⭐ HIGH PRIORITY¶

3. Graph Traversal & Multi-Hop Reasoning ⭐ MEDIUM PRIORITY¶

Priority 2: Important Enhancements (Phase 2)¶

4. Graph-Based Context Ranking ⭐ MEDIUM PRIORITY¶

5. Entity Disambiguation & Canonical IDs ⭐ MEDIUM PRIORITY¶

6. Conflict Resolution & Fact Versioning ⭐ LOW PRIORITY¶

Priority 3: Advanced Features (Phase 3)¶

7. Graph Analytics & Insights ⭐ LOW PRIORITY¶

Summary: Enhancement Priorities¶

Coreference Resolution Optimizations for Property Graph¶

Current Implementation Limitations¶

1. Over-Resolution Problem¶

2. Relationship Ambiguity¶

Required Optimizations¶

1. Selective Resolution Modes¶

2. Enhanced Output Structure¶

3. Relationship Preservation¶

4. Implementation Strategy¶

Phase 5: Semantic Memory Integration¶

Phase 6: Testing & Optimization¶

Configuration¶

Success Criteria¶

Deduplication Test¶

Completeness Test¶

Performance Test¶

Alignment with AICO Principles¶

Local-First, Privacy-First¶

Modular, Message-Driven Design¶

Extensibility¶

Autonomous Agency (Future)¶

Real-Time Emotional Intelligence (Future)¶

Natural Family Recognition (Future)¶

Research Foundation¶

Current Limitations & Future Enhancements¶

1. Additional Data Sources¶

2. Extended Relationship Types¶

3. Graph Analytics¶

4. Cross-User Knowledge Sharing¶

Query Language: GQL/Cypher via GrandCypher¶

Why GQL?¶

Implementation: GrandCypher¶