Knowledge Graph Module¶
Status: Proposal
Module: aico.ai.knowledge_graph (core AI module)
First Consumer: Semantic Memory
Research-Validated: 2025 industry best practices (GraphRAG, Graphusion, LlamaIndex, Semantic ER)
Overview¶
This document proposes a core knowledge graph module for AICO that provides property graph construction, entity resolution, and graph fusion capabilities. The module is domain-agnostic and reusable across multiple AI features.
First Application: Semantic memory deduplication (critical bug fix)
Future Applications: Relationship intelligence, autonomous agency, conversation context, emotional memory
Problem Statement (Semantic Memory)¶
Current semantic memory has a critical deduplication failure: running the same conversation twice creates duplicate facts because extraction is non-deterministic.
Run 1: "I moved to SF" → Extract → 14 facts
Run 2: "I moved to SF" → Extract → 28 facts (duplicates!)
Run 3: "I moved to SF" → Extract → 42 facts (unbounded growth)
Root Cause: Facts are stored as unstructured text without normalization. Same information expressed differently creates different embeddings, defeating similarity-based deduplication.
Proposed Solution: Core Knowledge Graph Module¶
A property graph represents knowledge as nodes (entities) and edges (relationships), both with typed properties. This provides deterministic structure for deduplication while maintaining rich metadata.
Module Architecture¶
aico/
ai/
knowledge_graph/ # Core module (domain-agnostic)
__init__.py
models.py # PropertyGraph, Node, Edge
extractor.py # Multi-pass extraction
entity_resolution.py # Semantic blocking, LLM matching/merging
fusion.py # Graph fusion, conflict resolution
storage.py # ChromaDB + libSQL backend
query.py # Graph traversal, filtering
memory/
semantic.py # Uses knowledge_graph module
Pipeline Architecture¶
Conversation → Multi-Pass Extraction → Property Graph → Semantic Entity Resolution → Graph Fusion → Storage
↓ ↓ ↓ ↓ ↓
extractor.py models.py entity_resolution.py fusion.py storage.py
Core Algorithms (Conceptual)¶
1. Multi-Pass Extraction (Gleanings)¶
Problem: Single-pass extraction misses information; repeated conversations create duplicate facts.
Concept:
- Pass 1: Extract entities and relations.
- Pass 2+: Ask the model "what did we miss?" and add only genuinely new facts.
- Final step: Infer implicit relations from accumulated context.
This produces a deterministic property graph for a conversation and significantly reduces missed information without mirroring the full Python implementation here.
2. Property Graph Model¶
Problem: Simple triplets [subject, relation, object] are too limited for rich metadata.
Conceptual model:
Node:
id: string
label: PERSON | PLACE | ORGANIZATION | EVENT | ...
properties: map<string, any> # name, age, city, etc.
embedding: vector<float>
Edge:
source_id: Node.id
target_id: Node.id
relation_type: string # WORKS_AT, MOVED_TO, KNOWS, ...
properties: map<string, any> # since, until, reason, etc.
Typed nodes and edges with flexible properties make the graph expressive and easy to extend without encoding full class definitions here.
3. Semantic Entity Resolution (Multi-Tier)¶
Problem: Simple string or embedding similarity is either too weak or too noisy.
Concept:
- Exact matching: Case-insensitive comparison on canonical names to merge obvious duplicates (fast, 100% precision).
- Semantic blocking: Use embeddings to create candidate pairs that might match.
- LLM verification: Only for ambiguous pairs, to decide if they are truly the same real-world entity.
- Canonical merge: Choose a canonical node and remap edges to preserve referential integrity.
This preserves accuracy while dramatically reducing LLM calls and maintaining a clean graph, without embedding the full function implementations here.
Implementation Status: ✅ Fully Implemented with enhancements beyond original design
Research Basis: "The Rise of Semantic Entity Resolution" (TDS, Jan 2025) + production optimizations
4. Graph Fusion (Global Perspective)¶
Problem: Per-message extraction misses global relationships across history.
Concept:
- Merge new graph slices into the existing graph by reusing entity resolution.
- Resolve conflicting edges between the same endpoints.
- Optionally infer new edges from the global structure and conversation history.
This turns incremental extractions into a coherent, evolving knowledge graph.
Storage Strategy¶
Hybrid: ChromaDB + libSQL¶
The knowledge graph module uses a hybrid storage approach leveraging AICO's existing stack:
ChromaDB Collections (Vector Search)¶
# Collection: kg_nodes
# Purpose: Semantic search over entities
node_doc = f"{node.label}: {node.properties.get('name', '')} {node.source_text}"
node_metadata = {
'node_id': node.id,
'label': node.label,
'properties': json.dumps(node.properties),
'confidence': node.confidence,
'user_id': user_id,
'created_at': node.created_at.isoformat()
}
chromadb.get_collection('kg_nodes').add(
documents=[node_doc],
embeddings=[node.embedding],
metadatas=[node_metadata],
ids=[node.id]
)
libSQL Tables (Relational Index)¶
Properties are stored both as JSON (source of truth) and as flattened key/value rows for efficient filtering. Database-level triggers keep these representations in sync.
Module Integration: Semantic Memory¶
At a high level, semantic memory uses the knowledge graph module by:
- Running multi-pass extraction on new conversation text.
- Resolving entities against the user-specific graph.
- Fusing the results into the existing graph.
- Persisting to the hybrid ChromaDB + libSQL backend.
- Querying facts via semantic search and graph traversal.
Future Applications (Beyond Semantic Memory)¶
The knowledge graph module is designed for reuse across AICO's AI features:
1. Relationship Intelligence¶
Use case: multi-dimensional relationship understanding, privacy boundaries, context-appropriate responses via graph traversal over KNOWS, FAMILY_MEMBER, and related relations.
2. Autonomous Agency¶
Use case: proactive suggestions, goal generation, and multi-step planning by querying the user’s context graph and inferring candidate goals.
3. Conversation Context Assembly¶
Use case: multi-hop reasoning to assemble rich conversational context from the user’s graph, reducing hallucination.
4. Emotional Memory¶
Use case: attaching emotional metadata to interactions and relationships so that future responses can respect emotional history.
Performance Optimization Strategies¶
The 2500ms latency is too high for conversational UX. However, we will not dumb down the solution. Instead, we decouple processing time from conversation flow using architectural strategies.
Strategy 1: Progressive Response with Cognitive States¶
Approach: Stream AI's cognitive process to user, deliver response early, process graph in background.
Timeline:
0ms: "Listening..." (instant feedback)
200ms: "Understanding..." (entity extraction)
600ms: AI RESPONSE DELIVERED ✅ (user can continue)
↓ [Background processing, non-blocking]
1800ms: "Storing in memory..." (graph construction)
2500ms: Complete (memory fully processed)
Strategy 2: Lazy Graph Construction¶
Approach: Respond immediately with simple extraction, then build the full knowledge graph during conversation pauses (idle time). This keeps perceived latency low while still building a rich graph in the background.
Strategy 3: Incremental Graph Construction¶
Approach: Build graph incrementally across multiple turns, not all at once.
Strategy 4: Parallel Processing¶
Approach: Run extraction passes and entity resolution in parallel where possible to reduce background latency, without changing the conversational path.
Strategy 5: Smart Caching¶
Approach: Cache expensive LLM operations for repeated entities.
Strategy 6: Cognitive State UI¶
Approach: Show users AICO's "inner monologue" (what is being extracted/stored) to make processing time feel intentional and transparent. └─ Connecting to previous locations
**Benefits:** Transforms wait time into feature, builds trust, reduces perceived latency.
---
## Recommended Approach
**Hybrid Strategy:**
### **Implementation (Required)**
Combines progressive response delivery with parallel processing for optimal user experience.
**Features:**
- Progressive response with cognitive states (600ms user-perceived latency)
- Parallel extraction and entity resolution (1200ms background vs 2340ms sequential)
- Background graph processing (non-blocking)
- Transparent UX with visible cognitive states
**Benefits:**
- ✅ 600ms user response time
- ✅ 1200ms background processing (49% faster than sequential)
- ✅ Full state-of-the-art quality, zero dumbing down
- ✅ Transparent UX builds trust
### **Enhancement (Optional)**
Additional optimizations for scaling and performance improvements.
**Features:**
- Smart caching for repeated entities (50ms vs 640ms, 92% faster)
- Lazy graph construction during idle time (2-3s gaps between messages)
- Incremental graph building across conversation turns
**Benefits:**
- ✅ Dramatic speedup for common entities (names, places)
- ✅ Zero user-perceived impact (processing during pauses)
- ✅ Distributed compute load across conversation
- ✅ Scales with conversation length
**Configuration:**
```yaml
core:
ai:
knowledge_graph:
processing:
# Implementation features are hardcoded (progressive response, parallel processing, background processing)
# Only performance tuning is configurable:
caching_enabled: true # Enable entity resolution caching (recommended)
cache_size: 1000 # Number of entities to cache (adjust based on memory)
Result: 600ms user response, 1200ms background processing, full state-of-the-art quality with optional caching for scaling.
Note: Implementation features are required for MVP. Enhancement features can be added later based on real-world performance testing and scaling needs.
Implementation Roadmap¶
Phase 0: Cleanup & Preparation¶
Scope: Remove obsolete code and prepare codebase for knowledge graph implementation
Deliverables:
Code Cleanup:
- Remove legacy fact extraction code from aico/ai/memory/semantic.py
- Remove obsolete AdvancedFactExtractor class and related utilities
- Clean up any placeholder/simulation code in memory evaluation
- Remove unused imports and dead code paths
libSQL Database Changes¶
❌ Remove (Schema Version 6):
- facts_metadata table → replaced by kg_nodes
- fact_relationships table → replaced by kg_edges
- session_metadata table → no longer needed (LMDB coordination)
✅ Add (Schema Version 7):
- kg_nodes table (with triggers for property indexing)
- kg_edges table (with triggers for property indexing)
- kg_node_properties table (property index)
- kg_edge_properties table (property index)
📝 Notes:
- No data migration needed (not in production)
- Ensure aico db init remains idempotent
ChromaDB Changes¶
❌ Remove:
- user_facts collection (fact-based semantic memory)
- Collection initialization code in cli/commands/database.py (lines 87-105)
✅ Add:
- kg_nodes collection (semantic search over entities)
- kg_edges collection (semantic search over relationships)
📝 Notes: - No data migration needed (not in production) - Complete replacement of old collection structure
LMDB Changes¶
❌ Remove:
- user_sessions named database from config (unused - not referenced in code)
✅ Keep (No Changes):
- LMDB working memory database (separate system for conversation context)
- session_memory named database (actively used by WorkingMemoryStore)
- LMDB initialization in cli/commands/database.py (line 350)
- cli/utils/lmdb_utils.py (working memory management)
🔄 Update:
- config/defaults/core.yaml: Remove user_sessions from core.memory.working.named_databases list
📝 Notes: - Working memory (LMDB) handles short-term conversation context (TTL: 24 hours) - Semantic memory (knowledge graph) handles long-term facts (permanent) - These are separate, independent systems
Documentation: - Document what was removed and why - Update architecture diagrams to reflect new knowledge graph approach - Create migration guide for any existing deployments
Validation:
- Verify aico db init still works (idempotent)
- Verify no broken imports or references to removed code
- Run existing tests to ensure nothing breaks
- Confirm clean slate for knowledge graph implementation
Note: This phase follows the principle of complete cleanup - no backwards compatibility, no "just in case" code. All obsolete components are fully removed before implementing the new system.
Phase 1: Core Module Foundation¶
Scope: Basic module structure and data models
Deliverables:
- Create aico/ai/knowledge_graph/ module structure
- Implement models.py (PropertyGraph, Node, Edge dataclasses)
- Implement storage.py (ChromaDB + libSQL hybrid backend)
- Create libSQL Schema Version 7:
- kg_nodes table with JSON properties
- kg_edges table with JSON properties
- kg_node_properties table (property index)
- kg_edge_properties table (property index)
- Database triggers for automatic property indexing (verified working in libsql 0.1.8)
- Initialize ChromaDB collections (kg_nodes, kg_edges)
- Implement dual-write to both ChromaDB and libSQL
- Basic CRUD operations (create, read, update, delete)
- Document property conventions for future applications
- Unit tests for data models and storage
Phase 2: Multi-Pass Extraction¶
Scope: Implement extraction pipeline
Deliverables:
- Implement extractor.py (multi-pass extraction with gleanings)
- GLiNER entity extraction (Pass 1)
- LLM relation extraction (Pass 1)
- Gleaning extraction (Pass 2+)
- Novel inference from conversation history (Pass N)
- Benchmark completeness improvements (60-70% → 90%+)
- Unit tests for extraction
Phase 3: Semantic Entity Resolution¶
Scope: Implement deduplication
Deliverables:
- Implement entity_resolution.py
- Semantic blocking (HDBSCAN clustering)
- LLM-based matching (with chain-of-thought)
- LLM-based merging (with conflict resolution)
- Test deduplication accuracy (target: 95%+)
- Unit tests for entity resolution
Future Enhancements¶
Priority 1: Critical Additions (Phase 1.5)¶
1. Temporal/Bi-Temporal Data Model ⭐ HIGH PRIORITY¶
Problem: Current schema only tracks when facts were recorded (created_at, updated_at), not when they were valid in real life.
Why Critical: - Relationship evolution: "Sarah was my girlfriend" → "Sarah is my wife" (temporal validity) - Historical context: "What was I working on last month?" requires point-in-time queries - Autonomous agency: Planning requires understanding temporal sequences - Emotional memory: "How did I feel about X over time?" needs temporal tracking
Implementation:
Add indexed temporal fields to tables:
ALTER TABLE kg_nodes ADD COLUMN valid_from TEXT;
ALTER TABLE kg_nodes ADD COLUMN valid_until TEXT;
ALTER TABLE kg_nodes ADD COLUMN is_current INTEGER DEFAULT 1;
ALTER TABLE kg_edges ADD COLUMN valid_from TEXT;
ALTER TABLE kg_edges ADD COLUMN valid_until TEXT;
ALTER TABLE kg_edges ADD COLUMN is_current INTEGER DEFAULT 1;
-- Indexes for temporal queries
CREATE INDEX idx_kg_nodes_temporal ON kg_nodes(user_id, is_current, valid_from);
CREATE INDEX idx_kg_edges_temporal ON kg_edges(user_id, is_current, valid_from);
Property conventions (stored in JSON properties field):
temporal:
valid_from: "2024-01-01T00:00:00Z" # When fact became true (event time)
valid_until: "2025-12-31T23:59:59Z" # When fact stopped being true (null = current)
recorded_at: "2024-01-15T10:30:00Z" # When AICO learned about it (ingestion time)
is_current: true # Quick filter for active facts
Benefits: - Point-in-time queries: "Show my relationships as of 6 months ago" - Temporal reasoning: "What changed since last week?" - Real-time incremental updates without batch reprocessing - Foundation for autonomous agency temporal planning
Research Basis: Graphiti/Zep's bi-temporal model (2025) - state-of-the-art for agent memory
2. Personal Graph Layer ⭐ HIGH PRIORITY¶
Problem: Current proposal focuses on knowledge graph (facts about the world) but lacks personal graph (user's activities, projects, goals).
Why Critical: - Autonomous agency: Requires understanding user's active projects, priorities, goals - Proactive engagement: "You mentioned wanting to learn piano—here's a practice reminder" - Context assembly: "What am I currently working on?" needs activity tracking - Relationship intelligence: Collaboration patterns, interaction frequency
Implementation:
New node labels (use existing kg_nodes.label field):
# Personal graph entities
- PROJECT: User's active projects
- GOAL: User's objectives (short/long-term)
- TASK: Actionable items
- ACTIVITY: User actions (created doc, attended meeting, etc.)
- INTEREST: User's developing interests
- PRIORITY: User's current priorities
New edge types (use existing kg_edges.relation_type field):
# Personal graph relationships
- WORKING_ON: User → Project
- HAS_GOAL: User → Goal
- CONTRIBUTES_TO: Task → Goal
- DEPENDS_ON: Task → Task (dependencies)
- COLLABORATES_WITH: User → Person (on Project)
- INTERESTED_IN: User → Topic
- PRIORITIZES: User → Priority
Property conventions for personal graph:
# Project properties
project:
status: "active" # active/paused/completed
progress: 0.6 # Float 0-1
deadline: "2025-12-31T23:59:59Z" # Target completion
priority: 1 # Int 1-5 (1=highest)
# Goal properties
goal:
type: "short_term" # short_term/long_term
status: "in_progress" # pending/in_progress/achieved/abandoned
motivation: "personal_growth" # Why user wants this
# Activity properties
activity:
activity_type: "document_created" # Type of activity
timestamp: "2025-10-30T10:00:00Z" # When activity occurred
duration_minutes: 45 # How long it took
context: "work" # work/personal/learning
Benefits: - Proactive assistance: Surface priorities, detect conflicts - Collaboration pattern detection - Personalized context assembly - Foundation for autonomous goal generation
Research Basis: Glean's Personal Graph (2025) - activity tracking + LLM reasoning for work context
3. Graph Traversal & Multi-Hop Reasoning ⭐ MEDIUM PRIORITY¶
Problem: Current proposal has basic CRUD but lacks graph traversal algorithms for multi-hop queries.
Why Critical: - Context assembly: "Find all information related to Sarah's piano recital" (multi-hop) - Relationship intelligence: "How do I know John?" (path finding) - Autonomous agency: "What dependencies block this goal?" (dependency chains)
Implementation:
Add to aico/ai/knowledge_graph/query.py:
async def traverse(
start_node: str,
relation_types: List[str],
max_depth: int = 3,
filters: Dict = None
) -> PropertyGraph:
"""Multi-hop graph traversal with filtering
Example: Find all entities connected to "Sarah" within 2 hops
via KNOWS or FAMILY_MEMBER relationships
"""
async def find_path(
source: str,
target: str,
max_depth: int = 5
) -> List[Path]:
"""Find shortest paths between entities
Example: "How do I know John?" → User → Sarah → John
"""
async def get_neighborhood(
node_id: str,
depth: int = 2,
relation_filter: List[str] = None
) -> PropertyGraph:
"""Get local subgraph around entity
Example: Get all entities within 2 hops of "piano_recital" event
"""
async def find_dependencies(
node_id: str,
relation_type: str = "DEPENDS_ON"
) -> List[Node]:
"""Find dependency chains for tasks/goals
Example: "What blocks this goal?" → Task1 → Task2 → Task3
"""
SQL Implementation:
-- Recursive CTE for graph traversal (libSQL supports this)
WITH RECURSIVE graph_traversal(node_id, depth, path) AS (
SELECT id, 0, id FROM kg_nodes WHERE id = ?
UNION ALL
SELECT e.target_id, gt.depth + 1, gt.path || ',' || e.target_id
FROM graph_traversal gt
JOIN kg_edges e ON gt.node_id = e.source_id
WHERE gt.depth < ? AND e.relation_type IN (?)
)
SELECT DISTINCT node_id FROM graph_traversal;
Benefits: - Rich context retrieval with relationship awareness - Path finding for relationship intelligence - Dependency analysis for autonomous agency - Foundation for complex reasoning
Priority 2: Important Enhancements (Phase 2)¶
4. Graph-Based Context Ranking ⭐ MEDIUM PRIORITY¶
Problem: Current proposal uses semantic similarity for retrieval but doesn't leverage graph structure for ranking.
Why Important: - Better context: Entities with more connections are more central/important - Relationship-aware retrieval: "Sarah" (close friend) ranks higher than "Sarah" (mentioned once) - Temporal relevance: Recent facts rank higher than old facts
Implementation:
Add graph metrics to nodes:
# Computed metrics (stored in properties JSON or separate fields)
graph_metrics:
degree_centrality: 0.85 # How many connections? (0-1)
temporal_recency: 0.92 # How recently discussed? (0-1)
interaction_frequency: 45 # How often mentioned? (count)
emotional_salience: 0.75 # Emotional intensity (0-1)
importance_score: 0.87 # Combined importance (0-1)
Context ranking algorithm:
def rank_context(nodes: List[Node], query: str) -> List[Node]:
"""Rank retrieved nodes by combined score"""
for node in nodes:
semantic_sim = compute_similarity(query, node.embedding)
graph_centrality = node.properties.get('graph_metrics', {}).get('degree_centrality', 0)
temporal_recency = compute_recency(node.updated_at)
emotional_salience = node.properties.get('graph_metrics', {}).get('emotional_salience', 0)
# Weighted combination
node.context_score = (
0.4 * semantic_sim +
0.3 * graph_centrality +
0.2 * temporal_recency +
0.1 * emotional_salience
)
return sorted(nodes, key=lambda n: n.context_score, reverse=True)
Benefits: - More relevant context retrieval - Relationship-aware ranking - Temporal and emotional awareness - Better than pure semantic similarity
5. Entity Disambiguation & Canonical IDs ⭐ MEDIUM PRIORITY¶
Problem: Entity resolution merges duplicates but doesn't maintain canonical entity IDs for disambiguation.
Why Important: - Multi-modal recognition: Voice says "Sarah" → Which Sarah? (use graph context) - Relationship intelligence: "Sarah" (daughter) vs "Sarah" (colleague) - Cross-conversation consistency: Same entity across sessions
Implementation:
Add to kg_nodes table:
ALTER TABLE kg_nodes ADD COLUMN canonical_id TEXT;
ALTER TABLE kg_nodes ADD COLUMN aliases_json TEXT; -- ["SF", "San Francisco", "The City"]
CREATE INDEX idx_kg_nodes_canonical ON kg_nodes(canonical_id);
Property conventions:
disambiguation:
canonical_id: "person_sarah_001" # Stable ID across merges
aliases: ["Sarah", "Sarah M.", "Mom"] # Known variations
disambiguation_context:
relationship: "daughter" # How related to user
age: 8 # Disambiguating attribute
primary_context: "family" # Main context for this entity
Entity resolution enhancement:
async def resolve_entity(
mention: str,
context: Dict
) -> Node:
"""Resolve ambiguous entity mention using graph context
Example: "Sarah" + context{"conversation_topic": "piano"}
→ Sarah (daughter) not Sarah (colleague)
"""
candidates = await search_nodes(mention)
if len(candidates) == 1:
return candidates[0]
# Use graph context for disambiguation
for candidate in candidates:
score = compute_context_match(candidate, context)
candidate.disambiguation_score = score
return max(candidates, key=lambda c: c.disambiguation_score)
Benefits: - Accurate entity resolution in conversations - Multi-modal recognition support - Cross-session consistency - Foundation for relationship intelligence
6. Conflict Resolution & Fact Versioning ⭐ LOW PRIORITY¶
Problem: Current fusion has conflict resolution but no version history for facts.
Why Useful: - Debugging: "Why does AICO think I live in SF?" (trace fact provenance) - Correction: "Actually, I moved to NYC" (update with history) - Trust: Show users how facts evolved over time - Audit trail: Track how knowledge changed
Implementation:
Add optional history table:
CREATE TABLE IF NOT EXISTS kg_node_history (
id TEXT PRIMARY KEY,
node_id TEXT NOT NULL,
version INTEGER NOT NULL,
properties JSON NOT NULL,
valid_from TEXT NOT NULL,
valid_until TEXT,
created_at TEXT NOT NULL,
change_reason TEXT, -- "user_correction", "conflict_resolution", "new_information"
FOREIGN KEY (node_id) REFERENCES kg_nodes(id) ON DELETE CASCADE,
INDEX idx_node_history_node (node_id, version)
);
CREATE TABLE IF NOT EXISTS kg_edge_history (
id TEXT PRIMARY KEY,
edge_id TEXT NOT NULL,
version INTEGER NOT NULL,
properties JSON NOT NULL,
valid_from TEXT NOT NULL,
valid_until TEXT,
created_at TEXT NOT NULL,
change_reason TEXT,
FOREIGN KEY (edge_id) REFERENCES kg_edges(id) ON DELETE CASCADE,
INDEX idx_edge_history_edge (edge_id, version)
);
Versioning logic:
async def update_node_with_history(
node_id: str,
new_properties: Dict,
change_reason: str
):
"""Update node and preserve history"""
# Get current version
current = await get_node(node_id)
# Archive current version
await archive_node_version(
node_id=node_id,
version=current.version,
properties=current.properties,
change_reason=change_reason
)
# Update to new version
await update_node(node_id, new_properties, version=current.version + 1)
Benefits: - Fact provenance tracking - User trust through transparency - Debugging and correction support - Audit trail for compliance
Priority 3: Advanced Features (Phase 3)¶
7. Graph Analytics & Insights ⭐ LOW PRIORITY¶
Problem: No graph algorithms for discovering patterns and insights.
Why Useful: - Autonomous agency: Detect emerging interests, suggest goals - Relationship intelligence: Identify relationship clusters, detect drift - Proactive engagement: "You haven't talked to John in 3 months" - Self-awareness: Help user understand their own patterns
Implementation:
Add analytics module aico/ai/knowledge_graph/analytics.py:
async def detect_communities(
user_id: str
) -> List[Community]:
"""Identify relationship clusters using community detection
Example: Family cluster, work cluster, hobby cluster
"""
async def compute_centrality(
user_id: str,
metric: str = "degree" # degree, betweenness, closeness
) -> Dict[str, float]:
"""Compute node importance using centrality measures
Example: Most important people, topics, projects
"""
async def detect_anomalies(
user_id: str,
time_window: str = "7d"
) -> List[Anomaly]:
"""Detect unusual patterns in user's graph
Example: Sudden drop in communication with close friend
"""
async def analyze_trends(
user_id: str,
entity_type: str = "INTEREST"
) -> List[Trend]:
"""Analyze emerging or declining patterns
Example: Growing interest in photography, declining interest in gaming
"""
async def suggest_goals(
user_id: str
) -> List[Goal]:
"""Generate goal suggestions based on graph patterns
Example: User talks about learning piano → Suggest "Learn piano" goal
"""
Benefits: - Proactive goal suggestions - Relationship health monitoring - Pattern discovery and insights - Foundation for true autonomous agency
Summary: Enhancement Priorities¶
Phase 1.5 (Critical - Add to MVP): 1. ✅ Temporal/bi-temporal data model (HIGH) 2. ✅ Personal graph layer (HIGH) 3. ✅ Graph traversal & multi-hop reasoning (MEDIUM)
Phase 2 (Important - Post-MVP): 4. Graph-based context ranking (MEDIUM) 5. Entity disambiguation & canonical IDs (MEDIUM) 6. Conflict resolution & fact versioning (LOW)
Phase 3 (Advanced - Future): 7. Graph analytics & insights (LOW)
Research Foundation: - Graphiti/Zep (2025): Bi-temporal knowledge graphs for agent memory (state-of-the-art) - Glean Personal Graph (2025): Activity tracking + LLM reasoning for work context - Industry Best Practices: Multi-hop reasoning, graph-based ranking, entity disambiguation
Coreference Resolution Optimizations for Property Graph¶
Current Implementation Limitations¶
The current coreference resolution approach has some challenges for optimal property graph construction:
1. Over-Resolution Problem¶
# Input: "John and I are working together. We think it will succeed."
# Current: "John and Michael are working together. John and Michael think it will succeed."
# Issue: Loses collective relationship nature ("We" becomes individual actions)
2. Relationship Ambiguity¶
# Property Graph Issue:
# Creates: Michael -[THINKS]-> "it will succeed"
# John -[THINKS]-> "it will succeed"
# Should be: (John, Michael) -[COLLECTIVELY_THINK]-> "it will succeed"
Required Optimizations¶
1. Selective Resolution Modes¶
- Full Resolution: Complete pronoun resolution (current approach)
- Selective Resolution: Only resolve entity-referring pronouns
- Graph-Optimized: Preserve collective pronouns and relationship context
2. Enhanced Output Structure¶
{
'resolved_text': str, # Fully resolved text
'partial_resolution': str, # Selectively resolved text
'entity_mappings': dict, # Pronoun -> Entity mappings
'collective_pronouns': list, # Preserved group pronouns
'relationship_context': dict # Relationship preservation info
}
3. Relationship Preservation¶
- Maintain collective actions ("we", "us", "our")
- Preserve temporal/causal sequences
- Keep relationship context for proper graph edge creation
4. Implementation Strategy¶
- Phase 1: Current cross-turn resolution (✅ implemented)
- Phase 2: Add property graph optimization modes
- Phase 3: Direct integration with graph construction
This ensures clean entity resolution while preserving the relationship semantics needed for accurate property graph representation.ons) - Validate global perspective - Unit tests for fusion
Phase 5: Semantic Memory Integration¶
Scope: Integrate knowledge graph with semantic memory
Deliverables:
- Refactor aico/ai/memory/semantic.py to use knowledge graph module
- Implement progressive response with parallel processing (required features)
- Implement background graph processing (non-blocking)
- Add property convention validation (temporal, provenance, emotional, etc.)
- Store properties following documented conventions
- Integration tests
- Deduplication test: stable fact count across runs (14 → 14 → 14, not 14 → 28 → 42)
Phase 6: Testing & Optimization¶
Scope: Comprehensive testing and performance optimization
Deliverables: - End-to-end tests (full pipeline) - Performance benchmarks (latency, cost, accuracy) - Implement caching (optional enhancement) - Documentation (API docs, usage examples) - Update configuration with property conventions
Configuration¶
Note: Knowledge graph uses existing model configuration from core.yaml. Models are managed centrally:
- Entity extraction: Uses modelservice.transformers.models.entity_extraction (GLiNER)
- Embeddings: Uses modelservice.default_models.embedding (paraphrase-multilingual)
- LLM operations: Uses modelservice.default_models.conversation (hermes3:8b)
memory:
semantic:
# Knowledge graph configuration
knowledge_graph:
# Extraction settings
max_gleanings: 2 # Number of gleaning passes (0-2 recommended)
# Entity resolution (deduplication)
deduplication:
enabled: true
similarity_threshold: 0.85 # Cosine similarity for semantic blocking
# Performance tuning
caching:
enabled: true # Cache entity resolution results
cache_size: 1000 # Number of entities to cache
What's NOT configurable (hardcoded in implementation):
- Storage backend: chromadb+libsql (hybrid, required)
- Processing mode: Progressive response with parallel processing
- Property conventions: Documented in Phase 0
- Model selection: Uses existing modelservice configuration
- Collections: kg_nodes, kg_edges (both ChromaDB and libSQL)
- Paths: Resolved via AICOPaths.get_semantic_memory_path()
Success Criteria¶
Deduplication Test¶
Completeness Test¶
Performance Test¶
Total latency: 2500ms ✅ (under 3s target)
Cost per conversation: $0.003 ✅ (acceptable)
Deduplication accuracy: 95%+ ✅
Alignment with AICO Principles¶
Local-First, Privacy-First¶
- ✅ All processing local except LLM matching/merging (gpt-4o-mini, optional)
- ✅ Property graph enables fine-grained privacy boundaries
- ✅ Hybrid storage (ChromaDB + libSQL) keeps data local
- ✅ No external dependencies for core functionality
Modular, Message-Driven Design¶
- ✅ Core knowledge graph module (
aico.ai.knowledge_graph) - ✅ Domain-agnostic, reusable across features
- ✅ Clean interfaces (PropertyGraph, Extractor, EntityResolver, GraphFusion)
- ✅ Feature flag for gradual rollout (
pipeline_mode) - ✅ Follows System > Domain > Module > Component hierarchy
Extensibility¶
- ✅ Plugin-ready: other modules can use knowledge graph
- ✅ Storage backend abstraction (ChromaDB+libSQL now, Neo4j future)
- ✅ Model abstraction (swap LLMs via modelservice)
Autonomous Agency (Future)¶
- ✅ Graph structure enables goal planning
- ✅ Multi-hop reasoning for proactive suggestions
- ✅ Context-aware decision making
Real-Time Emotional Intelligence (Future)¶
- ✅ Emotional context in relationships
- ✅ Emotional memory integration
- ✅ Relationship-appropriate empathy
Natural Family Recognition (Future)¶
- ✅ Rich relationship modeling
- ✅ Multi-dimensional understanding
- ✅ Dynamic learning from interactions
Research Foundation¶
- Microsoft GraphRAG (2024-2025) - Multi-pass extraction, hierarchical clustering
- Graphusion (ACL 2024) - Global perspective fusion, conflict resolution
- LlamaIndex PropertyGraph (2024-2025) - Property graph model, schema-guided extraction
- Semantic Entity Resolution (Jan 2025) - Embedding clustering + LLM validation
- Ditto (2020) - Deep entity matching with pre-trained LLMs (29% improvement)
Current Limitations & Future Enhancements¶
1. Additional Data Sources¶
Current: Knowledge graph only extracts from user conversations Limitation: Cannot incorporate external knowledge or structured data sources Future Enhancement: - Import from calendar events, emails, documents - Integration with external APIs (LinkedIn, Google Calendar, etc.) - Manual fact entry via CLI/UI - Bulk import from structured data (CSV, JSON)
2. Extended Relationship Types¶
Current: Basic relationship types extracted from conversation (WORKS_AT, LIVES_IN, KNOWS) Limitation: Limited semantic expressiveness for complex relationships Future Enhancement: - Hierarchical relationship taxonomy (IS_A, PART_OF, BELONGS_TO) - Temporal relationships (WORKED_AT_FROM_TO, LIVED_IN_UNTIL) - Causal relationships (CAUSED_BY, RESULTED_IN) - Emotional relationships (FEELS_ABOUT, REMINDS_OF) - Probabilistic relationships with confidence scores - Custom user-defined relationship types
3. Graph Analytics¶
Current: Simple node/edge retrieval via semantic search Limitation: No graph-level analysis or pattern detection Future Enhancement: - Centrality Analysis: Identify most important entities in user's life - Community Detection: Discover clusters of related entities (work, family, hobbies) - Path Finding: Multi-hop reasoning ("How is X connected to Y?") - Anomaly Detection: Identify unusual patterns or contradictions - Trend Analysis: Track how relationships evolve over time - Influence Propagation: Understand how changes affect connected entities
4. Cross-User Knowledge Sharing¶
Current: Each user has isolated knowledge graph (user_id scoping)
Status: ✅ Already Implemented - Data is user-bound via user_id in all tables
Clarification: "Single-user graphs" refers to lack of cross-user knowledge sharing, not lack of user isolation
Future Enhancement:
- Shared Knowledge Base: Common facts accessible to all users (e.g., "Paris is in France")
- Privacy-Aware Sharing: Users can opt-in to share specific facts
- Collaborative Learning: System learns from aggregate patterns across users
- Entity Disambiguation: Leverage cross-user data to resolve ambiguous entities
- Collective Intelligence: Improve extraction quality using multi-user validation
Example Use Cases: - Shared organizational knowledge (company structure, policies) - Public figure information (celebrities, politicians) - Common knowledge facts (geography, history) - Collaborative workspaces (team projects, shared goals)
Privacy Considerations: - Default: All user data is private and isolated - Opt-in: Users explicitly choose what to share - Anonymization: Shared data stripped of personal identifiers - Access Control: Fine-grained permissions for shared knowledge
Query Language: GQL/Cypher via GrandCypher¶
Why GQL?¶
GQL (Graph Query Language) is the new ISO standard (ISO/IEC 39075:2024) for property graphs, published April 2024. It's the first new ISO database language since SQL, designed specifically for graph databases.
Implementation: GrandCypher¶
AICO uses GrandCypher, a pure Python implementation of Cypher (90% compatible with GQL):
Benefits: - ✅ ISO Standard Syntax - Future-proof, industry-wide adoption expected - ✅ No Neo4j Dependency - Works with our libSQL + ChromaDB backend - ✅ Pure Python - Easy installation, no C compilation required - ✅ Extensible - Can add GQL features as standard evolves - ✅ Production-Ready - Used by research labs and production systems
Example Query:
MATCH (user:PERSON {name: 'Geralt'})-[:WORKS_FOR]->(company)
RETURN company.name, company.properties
Supported Features:
- Pattern matching: MATCH (a)-[r]->(b)
- Filtering: WHERE a.property = 'value'
- Aggregations: COUNT, SUM, AVG, MIN, MAX
- Ordering: ORDER BY, LIMIT, SKIP
- Type filtering: (:PERSON), [:WORKS_FOR]
- Multi-hop traversal: (a)-[]->(b)-[]->(c)
Security: - All queries automatically scoped to user_id - Query validation prevents injection attacks - Execution timeouts prevent DoS - Result size limits prevent memory exhaustion
API Design: GQL-First Approach¶
Design Philosophy¶
AICO uses a GQL-first API design following industry best practices:
- Single powerful query endpoint (GQL/Cypher) for all read operations
- Minimal REST endpoints for common operations and statistics
- Programmatic access via GQL for complex queries
Why GQL-First? - ✅ Flexibility: One endpoint handles all query patterns - ✅ Efficiency: Client requests exactly what they need - ✅ Simplicity: 2 endpoints vs 31 specialized endpoints - ✅ Standards-based: ISO GQL standard (ISO/IEC 39075:2024) - ✅ Future-proof: Query language evolves without API changes
Industry Examples: - Neo4j: Single Cypher endpoint + minimal REST - GraphQL: Single endpoint for all queries - Dgraph: Single GraphQL+- endpoint
Implemented Endpoints¶
Core API (2 endpoints)¶
1. POST /api/v1/kg/query - Execute GQL/Cypher queries ✅
# All operations via GQL:
# Get full graph
MATCH (n) RETURN n
# List nodes by type
MATCH (n:PERSON) RETURN n.name, n.properties
# Semantic search (via properties)
MATCH (n) WHERE n.name CONTAINS 'Sarah' RETURN n
# Get neighbors
MATCH (n {id: 'node_123'})-[r]-(m) RETURN n, r, m
# Find path
MATCH path = shortestPath((a)-[*]-(b))
WHERE a.name = 'Sarah' AND b.name = 'TechCorp'
RETURN path
# Analytics - centrality
MATCH (n)-[r]-()
RETURN n.name, count(r) as degree
ORDER BY degree DESC
# Temporal queries
MATCH (n)
WHERE n.is_current = 1 AND n.valid_from <= '2025-01-01'
RETURN n
# Complex multi-hop
MATCH (p:PERSON)-[:WORKS_FOR]->(c:ORG)-[:LOCATED_IN]->(city)
RETURN p.name, c.name, city.name
Supported Features:
- Pattern matching: MATCH (a)-[r]->(b)
- Filtering: WHERE, AND, OR
- Aggregations: COUNT, SUM, AVG, MIN, MAX
- Ordering: ORDER BY, LIMIT, SKIP
- Multi-hop: (a)-[*1..3]->(b)
- Shortest path: shortestPath()
2. GET /api/v1/kg/stats - Graph statistics ✅
{
"total_nodes": 150,
"total_edges": 320,
"node_types": {"PERSON": 45, "ORG": 30, "PROJECT": 25},
"edge_types": {"WORKS_FOR": 80, "KNOWS": 120}
}
Why Not 31 Endpoints?¶
Original Design Issues: 1. ❌ Over-engineering: 31 specialized endpoints for one feature 2. ❌ Maintenance burden: Each endpoint needs docs, tests, versioning 3. ❌ Inflexibility: New query patterns require new endpoints 4. ❌ Client complexity: Clients must learn 31 different APIs
GQL-First Benefits: 1. ✅ Single learning curve: Learn GQL once, query anything 2. ✅ Composability: Combine operations in one query 3. ✅ Efficiency: Reduce round-trips (get related data in one call) 4. ✅ Maintainability: One endpoint to secure, test, document
Example Comparison:
# REST approach (3 requests):
GET /api/v1/kg/nodes/123
GET /api/v1/kg/neighbors/123
GET /api/v1/kg/edges?source_id=123
# GQL approach (1 request):
MATCH (n {id: '123'})-[r]-(m)
RETURN n, r, m
Future Additions (If Needed)¶
Only add REST endpoints if: 1. High-frequency operations that benefit from caching 2. Non-technical users need simple URLs 3. External integrations require REST
Potential additions:
- GET /api/v1/kg/export - Export full graph
- POST /api/v1/kg/import - Bulk import
- GET /api/v1/kg/schema - Schema introspection
Total: 2 core endpoints (vs 31 proposed) following modern API design best practices.
Implementation Status¶
✅ Completed (Production-Ready)¶
Phase 1-5: Core Implementation (100%) - ✅ Property graph models with temporal support - ✅ Hybrid storage (ChromaDB + libSQL) - ✅ Multi-pass extraction with gleanings - ✅ Enhanced multi-tier entity resolution (60-80% LLM reduction) - ✅ Graph fusion with conflict resolution - ✅ Bonus edge integrity with atomic updates - ✅ KG consolidation scheduler
Phase 1.5: High-Priority Enhancements (100%) - ✅ Bi-temporal model (valid_from/until, is_current) - ✅ Personal graph layer (PROJECT, GOAL, TASK labels) - ✅ Graph traversal (BFS/DFS, path finding)
Phase 2: Medium-Priority Enhancements (100%) - ✅ Property index tables with automatic triggers - ✅ Canonical IDs and entity disambiguation
Phase 3: Advanced Features (100%) - ✅ Graph analytics (PageRank, centrality, clustering) - ✅ GQL/Cypher query support (GrandCypher)
API Layer (Minimal by Design) - ✅ GQL query endpoint (primary interface) - ✅ Statistics endpoint - ⚠️ Additional REST endpoints: Not needed (GQL-first approach)
🎉 Key Achievements¶
- Multi-Tier Entity Resolution (Beyond Original Design)
- Tier 1: Exact matching (60-80% of cases, instant)
- Tier 2: LLM verification (fuzzy cases only)
-
Result: 60-80% fewer LLM calls, same accuracy
-
Edge Integrity (Critical Addition)
- Node mapping during merge
- Atomic edge updates
-
100% referential integrity
-
Simplified API (Best Practice)
- GQL-first approach (2 endpoints vs 31)
- ISO standard query language
- More flexible, easier to maintain
📊 Performance Results¶
Deduplication: - ✅ Stable entity count across runs (14 → 14 → 14) - ✅ 95%+ accuracy maintained - ✅ 60-80% fewer LLM calls - ✅ 44% faster processing (146s vs 260s)
Edge Integrity: - ✅ 100% edges point to current nodes - ✅ 0 broken references (was 4/7) - ✅ Atomic updates with transactions
Extraction Completeness: - ✅ Multi-pass extraction working - ✅ Personal graph labels recognized - ✅ Semantic label correction active
Conclusion¶
The knowledge graph implementation exceeds the original proposal with production-grade enhancements:
- Multi-tier entity resolution reduces costs while maintaining accuracy
- Edge integrity ensures data consistency
- GQL-first API follows modern best practices
- Complete backend with all core features implemented
The system is production-ready and solves the deduplication problem while providing foundation for relationship intelligence, autonomous agency, and emotional memory.
Status: ✅ Implementation Complete - Ready for production use