Avatar Lip-Sync System¶

Overview¶

Real-time lip-sync for the 3D avatar using Web Audio API frequency analysis. The system analyzes audio amplitude and frequency bands to estimate phonemes and map them to ARKit blend shapes for natural mouth movement.

Current Implementation¶

Architecture¶

Input: Base64 WAV audio from TTS backend
Analysis: Web Audio API AnalyserNode (FFT size: 2048)
Detection: RMS amplitude + 3-band frequency analysis (low/mid/high)
Output: 12 visemes mapped to ARKit blend shapes
Interpolation: Smooth LERP transitions (50% speed) between visemes

Viseme Set (12 total)¶

Vowels (5): aa, E, I, O, U
Consonants (7): PP, FF, TH, DD, kk, SS, CH

Blend Shape Strategy¶

Each viseme uses multiple ARKit blend shapes for natural 3D movement: - Vertical: jawOpen (primary mouth opening) - Lateral: mouthStretch (horizontal width) - Depth: jawForward, mouthPucker, mouthFunnel (3D roundedness) - Detail: mouthSmile, tongueOut, mouthUpperUp, mouthLowerDown

Detection Method¶

Frequency-based heuristics across 9 amplitude levels: - High amplitude + low freq dominance → Open vowels (aa, O) - Medium amplitude + high freq dominance → Closed vowels (I) or sibilants (SS) - Low amplitude + high freq → Consonants (PP, FF, TH, CH) - Mid freq dominance → Alveolar/velar consonants (DD, kk)

Performance¶

Update rate: 60 FPS (animation loop)
Processing budget: <2ms per frame
Accuracy: ~75-80% phoneme approximation
Latency: <16ms (real-time)

Limitations¶

Current Constraints¶

No true phoneme analysis - frequency heuristics approximate phonemes
Missing visemes - No RR ® or nn (nasal) detection
Consonant ambiguity - Similar frequencies for different consonants
No coarticulation - Each viseme independent, no context awareness
Language-specific - Tuned for English phonemes

Comparison to Professional Systems¶

Oculus OVR LipSync: 15 visemes, 95% accuracy, phoneme-based
Our system: 12 visemes, 75-80% accuracy, frequency-based
Coverage: ~80% of professional viseme set

Future Enhancement: Rhubarb Lip-Sync¶

Overview¶

Rhubarb Lip-Sync is an MIT-licensed command-line tool for phoneme-accurate lip-sync analysis.

Integration Plan¶

Backend processing: Run Rhubarb on TTS-generated audio
Output format: JSON with viseme IDs + timestamps
Frontend playback: Sync viseme changes to audio timeline
Fallback: Keep current frequency analysis as backup

Expected Improvements¶

Accuracy: 75-80% → 85-90%
Phoneme coverage: True phoneme detection vs. frequency guessing
Consistency: Deterministic results vs. heuristic variation
Offline: No API dependencies, fully local

Implementation Effort¶

Backend: Add Rhubarb binary + audio processing pipeline
Frontend: Replace real-time detection with timeline playback
Migration: Gradual - keep current system during transition
Timeline: 2-3 weeks for full integration

Technical Details¶

Files¶

/frontend/assets/avatar/viewer.js - Main lip-sync implementation
/frontend/lib/data/repositories/tts_repository_impl.dart - Audio data pass-through

Key Functions¶

initLipSync() - Initialize Web Audio API analyser
detectViseme() - Frequency analysis → viseme detection
applyViseme() - Viseme → ARKit blend shapes with interpolation
updateLipSync() - 60 FPS animation loop

Configuration¶

LERP_SPEED: 0.5 (interpolation speed)
MAX_LIPSYNC_TIME_MS: 2ms (performance budget)
FFT size: 2048 (frequency resolution)
Smoothing: 0.8 (temporal smoothing)

Design Principles¶

Offline-first: No external API dependencies
Real-time: <16ms latency for responsive lip-sync
TTS-agnostic: Works with any audio source
Natural movement: Multiple blend shapes per viseme
Performance: <2ms processing per frame
Graceful degradation: Falls back to silence on errors