Skip to main content
Stream TTS audio as it’s generated for faster time-to-first-audio, especially with longer text.

Overview

Streaming TTS starts playing audio before the entire synthesis is complete. This is particularly useful for:
  • Long text passages
  • Voice assistants responding in real-time
  • Reducing perceived latency

Basic Concept

// The Voice Agent pipeline handles streaming TTS automatically
final session = await RunAnywhere.startVoiceSession(
  config: VoiceSessionConfig(
    autoPlayTTS: true,  // Automatically plays synthesized audio
  ),
);

session.events.listen((event) {
  if (event is VoiceSessionSpeaking) {
    print('Playing audio response...');
  }
});

Chunked Synthesis

For manual control, synthesize text in chunks:
Future<void> speakInChunks(String longText) async {
  // Split into sentences
  final sentences = longText.split(RegExp(r'(?<=[.!?])\s+'));

  for (final sentence in sentences) {
    final result = await RunAnywhere.synthesize(sentence);
    await playAudio(result);
  }
}

With Voice Agent Pipeline

The Voice Agent provides the best streaming TTS experience:
// Initialize all components
await RunAnywhere.loadSTTModel('sherpa-onnx-whisper-tiny.en');
await RunAnywhere.loadModel('smollm2-360m-q8_0');
await RunAnywhere.loadTTSVoice('vits-piper-en_US-lessac-medium');

// Start session with auto-play
final session = await RunAnywhere.startVoiceSession(
  config: VoiceSessionConfig(
    autoPlayTTS: true,
    continuousMode: true,
  ),
);

// The pipeline automatically:
// 1. Detects speech (VAD)
// 2. Transcribes audio (STT)
// 3. Generates response (LLM)
// 4. Synthesizes and plays audio (TTS)
session.events.listen((event) {
  switch (event) {
    case VoiceSessionTranscribed(:final text):
      print('User: $text');
    case VoiceSessionResponded(:final text):
      print('AI: $text');
    case VoiceSessionSpeaking():
      print('Playing response...');
    case VoiceSessionTurnCompleted():
      print('Ready for next turn');
  }
});

Latency Optimization Tips

Load the TTS voice during app startup or idle time, not when the user first needs it.
// In app initialization
await RunAnywhere.loadTTSVoice('vits-piper-en_US-lessac-medium');
Smaller voice models synthesize faster. Choose based on your quality/speed tradeoff.
For very long responses, synthesize and play sentence by sentence rather than waiting for the complete response.
Start playback as soon as you have enough audio buffered (typically 100-200ms).

See Also