Avatar Lip-Sync System¶
Overview¶
Real-time lip-sync for the 3D avatar using Web Audio API frequency analysis. The system analyzes audio amplitude and frequency bands to estimate phonemes and map them to ARKit blend shapes for natural mouth movement.
Current Implementation¶
Architecture¶
- Input: Base64 WAV audio from TTS backend
- Analysis: Web Audio API AnalyserNode (FFT size: 2048)
- Detection: RMS amplitude + 3-band frequency analysis (low/mid/high)
- Output: 12 visemes mapped to ARKit blend shapes
- Interpolation: Smooth LERP transitions (50% speed) between visemes
Viseme Set (12 total)¶
- Vowels (5): aa, E, I, O, U
- Consonants (7): PP, FF, TH, DD, kk, SS, CH
Blend Shape Strategy¶
Each viseme uses multiple ARKit blend shapes for natural 3D movement:
- Vertical: jawOpen (primary mouth opening)
- Lateral: mouthStretch (horizontal width)
- Depth: jawForward, mouthPucker, mouthFunnel (3D roundedness)
- Detail: mouthSmile, tongueOut, mouthUpperUp, mouthLowerDown
Detection Method¶
Frequency-based heuristics across 9 amplitude levels: - High amplitude + low freq dominance → Open vowels (aa, O) - Medium amplitude + high freq dominance → Closed vowels (I) or sibilants (SS) - Low amplitude + high freq → Consonants (PP, FF, TH, CH) - Mid freq dominance → Alveolar/velar consonants (DD, kk)
Performance¶
- Update rate: 60 FPS (animation loop)
- Processing budget: <2ms per frame
- Accuracy: ~75-80% phoneme approximation
- Latency: <16ms (real-time)
Limitations¶
Current Constraints¶
- No true phoneme analysis - frequency heuristics approximate phonemes
- Missing visemes - No RR ® or nn (nasal) detection
- Consonant ambiguity - Similar frequencies for different consonants
- No coarticulation - Each viseme independent, no context awareness
- Language-specific - Tuned for English phonemes
Comparison to Professional Systems¶
- Oculus OVR LipSync: 15 visemes, 95% accuracy, phoneme-based
- Our system: 12 visemes, 75-80% accuracy, frequency-based
- Coverage: ~80% of professional viseme set
Future Enhancement: Rhubarb Lip-Sync¶
Overview¶
Rhubarb Lip-Sync is an MIT-licensed command-line tool for phoneme-accurate lip-sync analysis.
Integration Plan¶
- Backend processing: Run Rhubarb on TTS-generated audio
- Output format: JSON with viseme IDs + timestamps
- Frontend playback: Sync viseme changes to audio timeline
- Fallback: Keep current frequency analysis as backup
Expected Improvements¶
- Accuracy: 75-80% → 85-90%
- Phoneme coverage: True phoneme detection vs. frequency guessing
- Consistency: Deterministic results vs. heuristic variation
- Offline: No API dependencies, fully local
Implementation Effort¶
- Backend: Add Rhubarb binary + audio processing pipeline
- Frontend: Replace real-time detection with timeline playback
- Migration: Gradual - keep current system during transition
- Timeline: 2-3 weeks for full integration
Technical Details¶
Files¶
/frontend/assets/avatar/viewer.js- Main lip-sync implementation/frontend/lib/data/repositories/tts_repository_impl.dart- Audio data pass-through
Key Functions¶
initLipSync()- Initialize Web Audio API analyserdetectViseme()- Frequency analysis → viseme detectionapplyViseme()- Viseme → ARKit blend shapes with interpolationupdateLipSync()- 60 FPS animation loop
Configuration¶
LERP_SPEED: 0.5 (interpolation speed)MAX_LIPSYNC_TIME_MS: 2ms (performance budget)- FFT size: 2048 (frequency resolution)
- Smoothing: 0.8 (temporal smoothing)
Design Principles¶
- Offline-first: No external API dependencies
- Real-time: <16ms latency for responsive lip-sync
- TTS-agnostic: Works with any audio source
- Natural movement: Multiple blend shapes per viseme
- Performance: <2ms processing per frame
- Graceful degradation: Falls back to silence on errors