Text-to-Speech (TTS) Architecture¶

Overview¶

Current implementation: AICO uses a backend TTS service in the modelservice (Piper TTS and Coqui XTTS v2) and streams ready-to-play audio to the Flutter frontend, which plays it via just_audio.

Avatar Lip-Sync Integration¶

TTS audio is also passed as base64 WAV data into the avatar WebView.
The WebView uses the Web Audio API (AnalyserNode) to perform frequency-based analysis and estimate visemes in real time.
Estimated visemes drive ARKit blend shapes on the Ready Player Me avatar (Phase 1 lip-sync).
A future Phase 2 enhancement will evaluate Rhubarb Lip Sync in the backend to generate phoneme-accurate viseme timings, which can be streamed to the avatar for higher-accuracy lip-sync.

Design Principles¶

Local-First Operation¶

TTS synthesis runs in the local backend modelservice; audio is then streamed to the frontend. Text and audio remain on the user's device.

Thin Client Alignment¶

Synthesis is handled by the backend modelservice; audio playback is client-side.

Zero Latency¶

Local backend processing eliminates external network round-trips, providing fast audio response with minimal buffering or streaming delays.

Cross-Platform Consistency¶

Using a centralized backend TTS engine (Piper/XTTS and potentially Kokoro) ensures consistent voice quality and behavior across all frontend platforms.

Architecture Components¶

Primary (Current): Backend TTS (Modelservice)¶

The primary, production implementation for TTS is the backend modelservice, using Piper TTS (recommended) and Coqui XTTS v2, with the Flutter app receiving only the audio stream and playing it via just_audio.

Future Backend Engine (Planned): Kokoro TTS¶

Status: Under evaluation as a potential third backend engine (alongside Piper and XTTS). Not implemented yet.

Conceptual Technology: Kokoro-82M via ONNX Runtime (Apache 2.0 License)

Target Characteristics (if adopted): - 82M parameter model - Fast inference on local backend - Consistent voice across platforms - ~100MB model download - Multi-language support - Emotional prosody control

Use Cases: - Premium voice quality - Character voice consistency (Eve personality) - Emotional expressiveness - Cross-platform voice uniformity

Custom Voice Creation (Planned Backend Models)¶

Custom Voice Creation¶

Training Custom Character Voices¶

For consistent cross-platform character voices (e.g., Eve personality), AICO envisions custom backend voice model training:

Approach: Fine-tune Kokoro-82M on character-specific voice samples

Requirements: - 10-30 minutes of clean audio samples - Consistent recording environment - Single speaker (target character voice) - Diverse emotional range samples

Training Process:

Data Preparation
Record or source voice samples
Clean audio (noise reduction, normalization)
Segment into 5-15 second clips
Transcribe all audio accurately
Model Fine-Tuning
Use Kokoro training scripts
Fine-tune on character voice dataset
Validate emotional range
Export ONNX model for mobile deployment
Model Distribution
Package ONNX model (~100MB)
Distribute via app assets or download
Cache locally on device
Version control for updates

Tools & Resources: - Kokoro training repository: github.com/hexgrad/kokoro - ONNX Runtime for mobile deployment - Audio preprocessing: Audacity, Adobe Audition - Dataset curation: Coqui dataset tools

Voice Consistency Strategy¶

Single Source of Truth: One trained backend model deployed across all platforms via ONNX ensures identical voice output regardless of device.

Model Management (Planned): - Models stored on the local backend or downloaded on-demand - Versioned model files for updates - Automatic backend-side fallback if a model is unavailable

Future Enhancements (Planned Backend Capabilities)¶

Emotional Prosody Control¶

Dynamic pitch/rate adjustment based on emotion state
Integration with emotion simulation system
Real-time prosody modulation
SSML support for fine-grained control

Voice Cloning¶

User-provided voice samples
Personal voice model training
Privacy-preserving on-device training
Custom voice library management

Multi-Speaker Conversations¶

Different voices for different characters
Conversation mode with speaker switching
Voice mixing for group scenarios

Streaming Synthesis¶

Sentence-by-sentence generation
Reduced perceived latency
Interrupt and resume support
Background audio processing

Dependencies¶

Flutter Packages (Current)¶

just_audio - Audio playback for streamed WAV from backend modelservice TTS

Testing Strategy¶

Unit Tests¶

TTS repository implementations
Voice type selection logic
Fallback mechanism validation
State management transitions

Integration Tests¶

Backend TTS request/response
Audio streaming over network (ZeroMQ/WebSocket)
Audio output verification on client (via just_audio)
Error handling and recovery

Monitoring & Observability¶

Metrics¶

TTS invocation count
Voice type usage distribution
Latency measurements
Error rates by platform
Model download success rate

Error Handling¶

Graceful fallback between TTS engines (e.g., XTTS → Piper)
Model loading failure recovery
Audio output device errors (client-side)
Network errors (model download or audio streaming)

Backend TTS (Modelservice)¶

AICO's backend supports two TTS engines: Piper TTS (ultra-fast, local) and Coqui XTTS v2 (high-quality, voice cloning). The system automatically detects language and applies appropriate text preprocessing.

Piper TTS (Recommended)¶

Technology: Piper - Fast, local neural TTS via ONNX Runtime

Features: - Ultra-fast synthesis (~300ms for full sentences) - 15-24x faster than XTTS - Multiple quality levels (x_low, low, medium, high) - 100+ voices across 40+ languages - On-device processing, no cloud dependencies - Automatic voice model download

Performance: - Synthesis: ~300ms per sentence - Model size: 5-30MB per voice (quality dependent) - Sample rates: 16kHz (low quality) or 22.05kHz (medium/high) - Memory: ~100MB RAM per loaded voice

Voice Quality Levels: - x_low: 16kHz, 5-7M params (fastest, lowest quality) - low: 16kHz, 15-20M params (fast, good quality) - medium: 22.05kHz, 15-20M params (balanced) - high: 22.05kHz, 28-32M params (best quality, slower)

Coqui XTTS v2 (High-Quality Alternative)¶

Technology: Coqui XTTS v2 - Neural TTS with voice cloning

Features: - 58 built-in multilingual speakers - 17 language support - Voice cloning from 6-second samples - Streaming audio synthesis - WAV format output - Excellent quality, natural prosody

Performance: - Synthesis: ~500ms per chunk - Model size: 1.8GB (auto-downloaded) - Sample rate: 22.05kHz - Memory: ~2GB RAM

Configuration¶

Both engines are configured in config/defaults/core.yaml under the core.modelservice.tts section:

core:
  modelservice:
    tts:
      enabled: true
      engine: "piper"  # or "xtts"
      auto_detect_language: true

      # XTTS Configuration
      xtts:
        voices:
          en: "Daisy Studious"
          de: "Daisy Studious"
        custom_voice_path: null

      # Piper Configuration
      piper:
        voices:
          en: "en_US-amy-medium"
          de: "de_DE-kerstin-low"
        quality: "medium"

      speed: 1.0

Piper Available Voices¶

German Female Voices¶

de_DE-kerstin-low - Clear, professional (recommended, sped up 10%)
de_DE-ramona-low - Younger-sounding, natural
de_DE-pavoque-low - Alternative option

English Voices¶

en_US-amy-medium - Clear, professional (default)
en_US-lessac-medium - Warm, friendly
en_US-libritts-high - Highest quality, slower
en_GB-alan-medium - British male
Many more available in Piper voice samples

Audio Processing & Optimizations¶

Text Preprocessing: - Automatic markdown removal (bold, italic, links) - Emoji removal (comprehensive Unicode ranges) - Em-dash/en-dash conversion to periods (for proper pauses) - Abbreviation expansion (P.S. → Postscript, e.g. → for example) - Ensures space after punctuation (critical for Piper pause detection)

Audio Post-Processing: - German voice speed-up: 10% faster using scipy polyphase resampling - Prevents "Mickey Mouse" effect while improving pace - High-quality interpolation maintains audio fidelity - Trailing artifact removal: 300ms fade-out + 500 samples forced to zero - Eliminates pop/click sounds at end of playback - Ensures smooth audio termination at zero crossing

Sample Rate Handling: - Automatic detection from voice model (16kHz or 22.05kHz) - Proper WAV header construction with correct sample rate - Backend buffers all audio before sending complete WAV file - Frontend receives ready-to-play WAV with no additional processing

Known Limitations: - Piper doesn't respect comma pauses (only periods create pauses) - Voice synthesis is non-deterministic (slight variations between runs) - Low-quality German voices have inherent noise characteristics - No medium/high quality female German voices available in Piper

XTTS Available Voices by Language¶

Female Voices¶

Claribel Dervla - Clear, professional
Daisy Studious - Warm, friendly (default)
Gracie Wise - Mature, authoritative
Tammie Ema - Energetic, young
Alison Dietlinde - Soft, gentle
Ana Florence - Elegant, refined
Annmarie Nele - Casual, approachable
Asya Anara - Exotic, mysterious
Brenda Stern - Strong, confident
Gitta Nikolina - Playful, cheerful
Henriette Usha - Sophisticated, calm
Sofia Hellen - Smooth, melodic
Tammy Grit - Determined, bold
Tanja Adelina - Sweet, caring
Vjollca Johnnie - Unique, distinctive

Male Voices¶

Andrew Chipper - Upbeat, friendly
Badr Odhiambo - Deep, resonant
Dionisio Schuyler - Theatrical, expressive
Royston Min - Calm, measured
Viktor Eka - Strong, commanding
Abrahan Mack - Warm, trustworthy
Adde Michal - Young, energetic
Baldur Sanjin - Mature, wise
Craig Gutsy - Bold, adventurous
Damien Black - Mysterious, dark
Gilberto Mathias - Friendly, approachable
Ilkin Urbano - Urban, modern
Kazuhiko Atallah - Precise, technical
Ludvig Milivoj - Noble, refined
Suad Qasim - Authoritative, serious
Torcull Diarmuid - Rugged, strong
Viktor Menelaos - Heroic, brave
Zacharie Aimilios - Gentle, kind

Neutral/Androgynous Voices¶

Nova Hogarth - Futuristic, neutral
Maja Ruoho - Balanced, clear
Uta Obando - Versatile, adaptable

Supported Languages¶

All voices work across all languages: - en - English - de - German
- es - Spanish - fr - French - it - Italian - pt - Portuguese - pl - Polish - tr - Turkish - ru - Russian - nl - Dutch - cs - Czech - ar - Arabic - zh-cn - Chinese (Simplified) - hu - Hungarian - ko - Korean - ja - Japanese - hi - Hindi

Voice Cloning¶

For custom voices, provide a 6-30 second WAV file:

Place WAV file in modelservice/assets/voices/
Set custom_voice_path in config
Restart modelservice

Custom voice overrides built-in speakers for all languages.

Performance Comparison¶

Metric	Piper TTS	XTTS v2
Model Size	5-30MB per voice	1.8GB
Synthesis Time	~300ms	~500ms per chunk
Speed vs XTTS	15-24x faster	Baseline
Quality	Good to Excellent	Excellent
Memory	~100MB	~2GB RAM
Languages	40+	17
Voices	100+	58 built-in
Voice Cloning	No	Yes (6s samples)
Sample Rate	16-22.05kHz	22.05kHz

Conclusion¶

AICO's TTS system focuses on a single, consistent architecture:

Current Backend (Modelservice): - Piper TTS (recommended): Ultra-fast synthesis (15-24x faster than XTTS) with 100+ voices across 40+ languages - XTTS v2 (alternative): High-quality synthesis with voice cloning capabilities - Automatic language detection and text preprocessing - Optimized audio processing with speed adjustments and artifact removal - Audio is streamed as WAV to the Flutter client and played via just_audio.

Planned Backend Extension: - Kokoro TTS considered as a potential third engine for the modelservice and for custom character voices.

The system prioritizes speed and quality while maintaining privacy and local-first operation on the user's machine. Piper TTS currently provides the best balance of performance and quality for most use cases, with XTTS available for scenarios requiring voice cloning or maximum quality, and Kokoro evaluated as a future backend option.