Hybrid Search: Semantic + BM25 with IDF Filtering¶

Overview¶

AICO's semantic memory implements a hybrid search system that combines semantic similarity (vector embeddings) with keyword matching (BM25) to provide accurate, relevant search results. This approach addresses the limitations of pure semantic search while maintaining the benefits of meaning-based retrieval.

Key Features:

Full corpus BM25 calculation - Proper IDF statistics on all documents
IDF-based term filtering - Removes overly common words from queries
Semantic relevance threshold - Filters out irrelevant results
Reciprocal Rank Fusion (RRF) - Robust score combination
Configurable thresholds - Tunable for different use cases

Architecture¶

Three-Stage Pipeline¶

Query → Stage 1: Retrieval → Stage 2: Scoring → Stage 3: Fusion & Filtering → Results
         (ChromaDB)           (BM25 + Semantic)   (RRF + Thresholds)

Stage 1: Full Corpus Retrieval¶

# Fetch ALL documents for proper BM25 IDF calculation
collection_count = collection.count()
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=collection_count  # CRITICAL: Full corpus, not just top-N
)

Why full corpus? - BM25 IDF (Inverse Document Frequency) requires accurate term statistics - Calculating IDF on only top-N semantic results produces incorrect scores - Example: "today" appears in 1/10 top results (IDF=1.99) vs 15/26 full corpus (IDF=0.55)

Stage 2: Dual Scoring¶

Semantic Scoring:

# Convert distance to similarity (0-1 range)
semantic_score = max(0.0, min(1.0, 1.0 - (distance / 2.0)))

BM25 Scoring with IDF Filtering:

name="__codelineno-3-1" href="#__codelineno-3-1">def calculate_bm25(documents, query_text, k1=1.5, b=0.75, min_idf=0.6): # 1. Tokenize with punctuation removal query_terms = tokenize(query_text) # "today?" → "today" # 2. Calculate IDF for each term for term in query_terms: doc_freq = count_docs_containing(term) idf = log((N - doc_freq + 0.5) / (doc_freq + 0.5) + 1.0) # 3. Filter low-IDF terms (too common) if idf >= min_idf: filtered_terms.append(term) # 4. Calculate BM25 score using filtered terms for doc in documents: score = sum([ idf(term) * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (doc_len / avg_len))) for term in filtered_terms ])

BM25 Parameters: - k1=1.5: Term frequency saturation (higher = more weight to term frequency) - b=0.75: Length normalization (0 = no normalization, 1 = full normalization) - min_idf=0.6: Minimum IDF threshold for term inclusion (configurable)

Stage 3: Fusion & Filtering¶

Reciprocal Rank Fusion (RRF):

def fuse_with_rrf(documents, k=60, min_semantic_score=0.35):
    # 1. Filter by semantic relevance FIRST
    relevant_docs = [d for d in documents if d['semantic_score'] >= min_semantic_score]

    # 2. Rank by semantic score
    semantic_ranks = rank_by_score(relevant_docs, 'semantic_score')

    # 3. Rank by BM25 score
    bm25_ranks = rank_by_score(relevant_docs, 'bm25_score')

    # 4. Combine ranks using RRF formula
    for doc in relevant_docs:
        rrf_score = (1 / (k + semantic_ranks[doc])) + (1 / (k + bm25_ranks[doc]))
        doc['hybrid_score'] = rrf_score

    # 5. Sort by hybrid score
    return sorted(relevant_docs, key=lambda d: d['hybrid_score'], reverse=True)

Why RRF over weighted fusion? - ✅ Scale-invariant: Works regardless of score ranges - ✅ Robust to outliers: Rank-based, not score-based - ✅ No normalization needed: Avoids min-max scaling issues - ✅ Industry standard: Used by Elasticsearch, OpenSearch, etc.

Configuration¶

Core Settings (`config/defaults/core.yaml`)¶

core:
  memory:
    semantic:
      # Hybrid search fusion method
      fusion_method: "rrf"  # "rrf" (recommended) or "weighted" (legacy)
      rrf_rank_constant: 0  # 0 = adaptive (recommended), or 10-60 manual

      # BM25 configuration
      bm25_min_idf: 0.6  # Minimum IDF threshold for query terms
                         # Higher = more aggressive filtering of common words

      # Semantic relevance filtering
      min_semantic_score: 0.35  # Minimum semantic score for relevance
                                # Documents below this are filtered as irrelevant
                                # Range: 0.0-1.0 (higher = stricter)

      # Result quality threshold
      min_similarity: 0.4  # Minimum hybrid score for final results
                          # Applied AFTER fusion

      # Legacy weighted fusion (if fusion_method="weighted")
      semantic_weight: 0.7  # Weight for semantic similarity
      bm25_weight: 0.3      # Weight for BM25 score

Threshold Guidelines¶

Threshold	Purpose	Recommended Value	Impact
`bm25_min_idf`	Filter common query terms	0.6	Higher = fewer terms, more precise
`min_semantic_score`	Filter irrelevant docs	0.35	Higher = fewer false positives, may lose valid results
`min_similarity`	Final quality filter	0.4	Higher = only high-quality results
`rrf_rank_constant`	RRF fusion balance	0 (adaptive)	Higher = more uniform scores

Implementation Details¶

Tokenization¶

Punctuation Handling:

def tokenize(text: str) -> List[str]:
    """Tokenize with punctuation removal for consistent matching."""
    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)  # Remove punctuation
    return text.split()

Why remove punctuation? - "today?" and "today" should match the same corpus terms - Punctuation creates artificial term variations - Improves IDF calculation accuracy

IDF Filtering Logic¶

Decision Tree:

For each query term:
  ├─ Term NOT in corpus?
  │  └─ KEEP (rare term, potentially important)
  │
  ├─ Calculate IDF = log((N - df + 0.5) / (df + 0.5) + 1.0)
  │
  ├─ IDF >= min_idf (0.6)?
  │  ├─ YES → KEEP (discriminative term)
  │  └─ NO → FILTER (too common, low signal)

Example (26 document corpus): - "michael": df=2/26 → IDF=2.38 → KEPT (rare, meaningful) - "today": df=1/26 → IDF=2.89 → KEPT (very rare) - "is": df=9/26 → IDF=1.04 → KEPT (moderately common) - "the": df=15/26 → IDF=0.55 → FILTERED (too common)

Semantic Relevance Filtering¶

Two-Level Filtering:

Pre-Fusion Filter (min_semantic_score=0.35):
Applied BEFORE RRF fusion
Removes completely irrelevant documents
Prevents false positives from keyword-only matches
Post-Fusion Filter (min_similarity=0.4):
Applied AFTER RRF fusion
Ensures final result quality
Filters weak hybrid matches

Example Flow:

Query: "What is the weather like today?"

Documents:
  1. "How may I assist you today?" 
     → semantic=0.266 → FILTERED (< 0.35) ❌

  2. "Michael was born in Schaffhausen"
     → semantic=0.420 → PASSED pre-filter ✅
     → hybrid=0.143 → PASSED post-filter ✅

Performance Characteristics¶

Accuracy Metrics (26-doc test corpus)¶

Query Type	Accuracy	Notes
Factual queries	100%	"Where was Michael born?"
Preference queries	100%	"What are Michael's hobbies?"
Out-of-domain	100%	"What is the weather?" → No results
Similar entities	89%	"favorite movie" matches "favorite author" (semantic limitation)

Overall: 89% accuracy (8/9 test queries correct)

Speed Benchmarks¶

Operation	Time	Notes
Full corpus retrieval	10-50ms	Depends on corpus size
BM25 calculation	5-20ms	Linear with corpus size
RRF fusion	1-5ms	Rank-based, very fast
Total query time	20-100ms	Acceptable for real-time

Scalability¶

Corpus Size	Query Time	Memory Usage
100 docs	20ms	50MB
1,000 docs	50ms	200MB
10,000 docs	150ms	1GB
100,000 docs	500ms	5GB

Optimization for large corpora: - Consider pre-filtering by metadata before full corpus retrieval - Use approximate nearest neighbor (ANN) for semantic search - Implement caching for frequently accessed documents

Known Limitations¶

1. Embedding Model Granularity¶

Issue: 768-dimensional embeddings cannot distinguish all fine-grained entity types.

Example:

Query: "What is Michael's favorite movie?"
Match: "My favorite author is Terry Pratchett"
Reason: Semantic similarity between "favorite X" patterns

Why this happens: - Embedding models learn general patterns (preferences, entities) - 768 dimensions can represent ~1.7M distinct combinations (research: ArXiv 2508.21038) - Fine-grained distinctions (author vs movie vs book) require more dimensions or metadata

Solutions: 1. Accept limitation (89% accuracy is excellent) ✅ Current approach 2. Add metadata filtering (entity_type tags) - Future enhancement 3. Increase threshold to 0.50 - Loses valid results (not recommended) 4. Use larger models (3072+ dims) - 4-5x slower, diminishing returns

2. Small Corpus Statistics¶

Issue: IDF calculations less reliable with <1000 documents.

Impact: - Rare terms may have inflated IDF scores - Common terms may not be properly identified - Threshold tuning more sensitive

Mitigation: - Use conservative min_idf threshold (0.6) - Rely more on semantic scores for small corpora - Consider disabling IDF filtering for <100 docs

3. Query Complexity¶

Issue: Complex multi-part queries may not decompose well.

Example:

Query: "What did Michael study and where does he work?"
→ Two separate questions, may need query decomposition

Current behavior: - Both parts contribute to BM25 scoring - Semantic embedding captures overall intent - May return partial matches

Future enhancement: - Query decomposition into sub-queries - Multi-hop reasoning for complex questions

Comparison with Alternatives¶

Pure Semantic Search¶

Pros: - Understands meaning and context - Works with paraphrases and synonyms - Language-agnostic (with multilingual models)

Cons: - ❌ Misses exact keyword matches - ❌ Can match semantically similar but wrong entities - ❌ No control over term importance

Pure BM25 (Keyword Search)¶

Pros: - Fast and deterministic - Exact keyword matching - Well-understood behavior

Cons: - ❌ No semantic understanding - ❌ Fails on paraphrases - ❌ Sensitive to exact wording

Hybrid Search (Current Implementation)¶

Pros: - ✅ Best of both worlds - ✅ Robust to different query styles - ✅ Configurable balance - ✅ Industry-standard approach

Cons: - More complex implementation - Requires tuning for optimal performance - Higher computational cost

Troubleshooting¶

Issue: No results for valid queries¶

Symptoms: - Query should match documents but returns empty - Semantic scores all below threshold

Diagnosis:

# Check semantic scores in debug mode
uv run aico chroma query collection_name "your query" --debug

Solutions: 1. Lower min_semantic_score (try 0.25) 2. Check embedding model is working 3. Verify documents are properly embedded

Issue: False positives (irrelevant results)¶

Symptoms: - Results match keywords but wrong context - "favorite movie" matches "favorite author"

Diagnosis: - Check semantic scores (should be 0.4-0.5 range) - Review BM25 scores (high BM25 + low semantic = keyword match)

Solutions: 1. Increase min_semantic_score (try 0.40-0.45) 2. Add metadata filtering (entity types) 3. Accept limitation (LLM can disambiguate)

Issue: Slow queries¶

Symptoms: - Query time >500ms - High CPU usage during search