Voice Pipeline

Early Beta — The Web SDK is in early beta. APIs may change between releases.

Overview

The Voice Pipeline provides complete STT -> LLM -> TTS orchestration for building voice conversation experiences in the browser. It handles the full flow: transcribe user speech, generate an AI response with streaming, and synthesize spoken audio.

Package Imports

The Voice Pipeline uses classes from all three packages:

import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'

Required Models

The voice pipeline requires exactly 4 models loaded simultaneously. Each serves a distinct purpose:

Role	Model Category	Example Model	What It Does
VAD	`ModelCategory.Audio`	Silero VAD v5 (~5MB)	Detects when the user starts/stops speaking
STT	`ModelCategory.SpeechRecognition`	Whisper Tiny EN (~105MB)	Transcribes spoken audio to text
LLM	`ModelCategory.Language`	LFM2 350M (~250MB)	Generates the AI text response
TTS	`ModelCategory.SpeechSynthesis`	Piper TTS (~65MB)	Converts the AI response to spoken audio

VAD is NOT a replacement for STT. This is a common source of confusion. VAD (Voice Activity Detection) only detects when someone is speaking — it outputs speech boundaries (started/ended), not text. STT (Speech-to-Text) performs actual transcription of audio to text. You need both: VAD to segment the audio stream into speech chunks, and STT to transcribe those chunks into text.If your voice feature can detect speech activity but produces no transcription text, you are likely missing the STT model. A 5MB VAD model cannot do speech recognition — you need a Whisper model (~105MB+) for that.

Setup

Before using the Voice Pipeline, ensure all 4 models are loaded with coexist: true so they stay in memory simultaneously:

import {
  RunAnywhere,
  SDKEnvironment,
  ModelManager,
  ModelCategory,
  LLMFramework,
} from '@runanywhere/web'
import { LlamaCPP } from '@runanywhere/web-llamacpp'
import { ONNX } from '@runanywhere/web-onnx'

// 1. Initialize SDK and register backends
await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()
await ONNX.register()

// 2. Register all voice models
RunAnywhere.registerModels([
  {
    id: 'lfm2-350m-q4_k_m',
    name: 'LFM2 350M Q4_K_M',
    repo: 'LiquidAI/LFM2-350M-GGUF',
    files: ['LFM2-350M-Q4_K_M.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Language,
    memoryRequirement: 250_000_000,
  },
  {
    id: 'sherpa-onnx-whisper-tiny.en',
    name: 'Whisper Tiny English',
    url: 'https://huggingface.co/runanywhere/sherpa-onnx-whisper-tiny.en/resolve/main/sherpa-onnx-whisper-tiny.en.tar.gz',
    framework: LLMFramework.ONNX,
    modality: ModelCategory.SpeechRecognition,
    memoryRequirement: 105_000_000,
    artifactType: 'archive' as const,
  },
  {
    id: 'vits-piper-en_US-lessac-medium',
    name: 'Piper TTS',
    url: 'https://huggingface.co/runanywhere/vits-piper-en_US-lessac-medium/resolve/main/vits-piper-en_US-lessac-medium.tar.gz',
    framework: LLMFramework.ONNX,
    modality: ModelCategory.SpeechSynthesis,
    memoryRequirement: 65_000_000,
    artifactType: 'archive' as const,
  },
  {
    id: 'silero-vad-v5',
    name: 'Silero VAD v5',
    url: 'https://huggingface.co/runanywhere/silero-vad-v5/resolve/main/silero_vad.onnx',
    files: ['silero_vad.onnx'],
    framework: LLMFramework.ONNX,
    modality: ModelCategory.Audio,
    memoryRequirement: 5_000_000,
  },
])

// 3. Download and load all models with coexist: true
for (const modelId of [
  'silero-vad-v5',
  'sherpa-onnx-whisper-tiny.en',
  'lfm2-350m-q4_k_m',
  'vits-piper-en_US-lessac-medium',
]) {
  await ModelManager.downloadModel(modelId)
  await ModelManager.loadModel(modelId, { coexist: true })
}

coexist: true is required. Without it, loading a new model unloads the previous one, and the voice pipeline needs all 4 models (VAD, STT, LLM, TTS) in memory simultaneously.

Verify all 4 models are loaded before starting the pipeline. Use ModelManager.getLoadedModel() to check each category:

const vad = ModelManager.getLoadedModel(ModelCategory.Audio)
const stt = ModelManager.getLoadedModel(ModelCategory.SpeechRecognition)
const llm = ModelManager.getLoadedModel(ModelCategory.Language)
const tts = ModelManager.getLoadedModel(ModelCategory.SpeechSynthesis)

if (!vad || !stt || !llm || !tts) {
  console.error('Missing models:', {
    vad: !!vad,
    stt: !!stt,
    llm: !!llm,
    tts: !!tts,
  })
  throw new Error('All 4 voice pipeline models must be loaded')
}

If STT is missing, the pipeline will fail silently or produce empty transcriptions. If VAD is missing, speech detection won’t trigger.

Basic Usage

import { VoicePipeline } from '@runanywhere/web'

const pipeline = new VoicePipeline()

const result = await pipeline.processTurn(
  audioFloat32Array,
  {
    maxTokens: 60,
    temperature: 0.7,
    systemPrompt: 'You are a helpful voice assistant. Keep responses concise — 1-2 sentences max.',
  },
  {
    onTranscription: (text) => console.log('User said:', text),
    onResponseToken: (token, accumulated) => console.log('AI:', accumulated),
    onResponseComplete: (text) => console.log('Full response:', text),
    onSynthesisComplete: async (audio, sampleRate) => {
      const player = new AudioPlayback({ sampleRate })
      await player.play(audio, sampleRate)
      player.dispose()
    },
    onStateChange: (state) => console.log('Pipeline state:', state),
  }
)

console.log('Transcription:', result.transcription)
console.log('Response:', result.response)

API Reference

`VoicePipeline`

class VoicePipeline {
  /** Process a complete voice turn: STT -> LLM -> TTS */
  processTurn(
    audioData: Float32Array,
    options?: VoicePipelineOptions,
    callbacks?: VoicePipelineCallbacks
  ): Promise<VoicePipelineTurnResult>

  /** Cancel an in-progress turn */
  cancel(): void
}

VoicePipelineOptions

interface VoicePipelineOptions {
  /** Maximum tokens for LLM response (default: 150) */
  maxTokens?: number

  /** LLM temperature (default: 0.7) */
  temperature?: number

  /** System prompt for the LLM */
  systemPrompt?: string

  /** TTS speech speed (default: 1.0) */
  ttsSpeed?: number

  /** Audio sample rate (default: 16000) */
  sampleRate?: number
}

VoicePipelineCallbacks

interface VoicePipelineCallbacks {
  /** Called when pipeline state changes */
  onStateChange?: (state: string) => void
  // States: 'processingSTT', 'generatingResponse', 'playingTTS'

  /** Called when transcription completes */
  onTranscription?: (text: string) => void

  /** Called for each generated token */
  onResponseToken?: (token: string, accumulated: string) => void

  /** Called when LLM response completes */
  onResponseComplete?: (text: string) => void

  /** Called when TTS synthesis completes */
  onSynthesisComplete?: (audio: Float32Array, sampleRate: number) => void

  /** Called on error */
  onError?: (error: Error) => void
}

VoicePipelineTurnResult

interface VoicePipelineTurnResult {
  /** Transcribed user speech */
  transcription: string

  /** LLM-generated response */
  response: string
}

Complete Voice Assistant Example

This example combines VAD for automatic speech detection with the Voice Pipeline for full STT -> LLM -> TTS processing:

import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'

const pipeline = new VoicePipeline()
const mic = new AudioCapture({ sampleRate: 16000 })

// Set up VAD to detect speech segments
VAD.reset()

const vadUnsub = VAD.onSpeechActivity(async (activity) => {
  if (activity === SpeechActivity.Ended) {
    const segment = VAD.popSpeechSegment()
    if (!segment || segment.samples.length < 1600) return

    // Stop mic during processing
    mic.stop()
    vadUnsub()

    // Process the voice turn
    const result = await pipeline.processTurn(
      segment.samples,
      {
        maxTokens: 60,
        temperature: 0.7,
        systemPrompt: 'You are a helpful assistant. Be concise.',
      },
      {
        onTranscription: (text) => {
          document.getElementById('user-text')!.textContent = text
        },
        onResponseToken: (_, accumulated) => {
          document.getElementById('ai-text')!.textContent = accumulated
        },
        onResponseComplete: (text) => {
          document.getElementById('ai-text')!.textContent = text
        },
        onSynthesisComplete: async (audio, sampleRate) => {
          const player = new AudioPlayback({ sampleRate })
          await player.play(audio, sampleRate)
          player.dispose()
        },
      }
    )

    console.log('Turn complete:', result.transcription, '->', result.response)
  }
})

// Feed microphone audio to VAD
await mic.start(
  (chunk) => {
    VAD.processSamples(chunk)
  },
  (level) => {
    // Visualize audio level (0.0-1.0)
    document.getElementById('orb')!.style.transform = `scale(${1 + level * 0.5})`
  }
)

React Voice Assistant

VoiceAssistant.tsx

import { useState, useCallback, useRef, useEffect } from 'react'
import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'

type VoiceState = 'idle' | 'loading' | 'listening' | 'processing' | 'speaking'

export function VoiceAssistant() {
  const [voiceState, setVoiceState] = useState<VoiceState>('idle')
  const [transcript, setTranscript] = useState('')
  const [response, setResponse] = useState('')
  const [audioLevel, setAudioLevel] = useState(0)

  const micRef = useRef<AudioCapture | null>(null)
  const pipelineRef = useRef<VoicePipeline | null>(null)
  const vadUnsubRef = useRef<(() => void) | null>(null)

  useEffect(() => {
    return () => {
      micRef.current?.stop()
      vadUnsubRef.current?.()
    }
  }, [])

  const startListening = useCallback(async () => {
    setTranscript('')
    setResponse('')
    setVoiceState('loading')

    // Ensure all voice models are loaded
    const categories = [
      ModelCategory.Audio,
      ModelCategory.SpeechRecognition,
      ModelCategory.Language,
      ModelCategory.SpeechSynthesis,
    ]

    for (const cat of categories) {
      if (!ModelManager.getLoadedModel(cat)) {
        const models = ModelManager.getModels().filter((m) => m.modality === cat)
        if (models[0]) {
          await ModelManager.downloadModel(models[0].id)
          await ModelManager.loadModel(models[0].id, { coexist: true })
        }
      }
    }

    setVoiceState('listening')

    const mic = new AudioCapture({ sampleRate: 16000 })
    micRef.current = mic

    if (!pipelineRef.current) {
      pipelineRef.current = new VoicePipeline()
    }

    VAD.reset()

    vadUnsubRef.current = VAD.onSpeechActivity(async (activity) => {
      if (activity === SpeechActivity.Ended) {
        const segment = VAD.popSpeechSegment()
        if (!segment || segment.samples.length < 1600) return

        mic.stop()
        vadUnsubRef.current?.()
        setVoiceState('processing')

        try {
          await pipelineRef.current!.processTurn(
            segment.samples,
            {
              maxTokens: 60,
              temperature: 0.7,
              systemPrompt: 'You are a helpful voice assistant. Keep responses concise.',
            },
            {
              onTranscription: (text) => setTranscript(text),
              onResponseToken: (_, acc) => setResponse(acc),
              onResponseComplete: (text) => setResponse(text),
              onSynthesisComplete: async (audio, sampleRate) => {
                setVoiceState('speaking')
                const player = new AudioPlayback({ sampleRate })
                await player.play(audio, sampleRate)
                player.dispose()
              },
            }
          )
        } catch (err) {
          console.error('Voice pipeline error:', err)
        }

        setVoiceState('idle')
        setAudioLevel(0)
      }
    })

    await mic.start(
      (chunk) => {
        VAD.processSamples(chunk)
      },
      (level) => {
        setAudioLevel(level)
      }
    )
  }, [])

  const stopListening = useCallback(() => {
    micRef.current?.stop()
    vadUnsubRef.current?.()
    setVoiceState('idle')
    setAudioLevel(0)
  }, [])

  return (
    <div style={{ padding: 24, textAlign: 'center' }}>
      <div
        style={{
          width: 80,
          height: 80,
          borderRadius: '50%',
          background:
            voiceState === 'listening'
              ? '#4caf50'
              : voiceState === 'speaking'
                ? '#2196f3'
                : '#e0e0e0',
          margin: '0 auto 16px',
          transform: `scale(${1 + audioLevel * 0.3})`,
          transition: 'background-color 0.2s',
        }}
      />

      <p>
        {voiceState === 'idle' && 'Tap to start listening'}
        {voiceState === 'loading' && 'Loading models...'}
        {voiceState === 'listening' && 'Listening... speak now'}
        {voiceState === 'processing' && 'Processing...'}
        {voiceState === 'speaking' && 'Speaking...'}
      </p>

      {voiceState === 'idle' || voiceState === 'loading' ? (
        <button onClick={startListening} disabled={voiceState === 'loading'}>
          Start Listening
        </button>
      ) : voiceState === 'listening' ? (
        <button onClick={stopListening}>Stop</button>
      ) : null}

      {transcript && (
        <p>
          <strong>You:</strong> {transcript}
        </p>
      )}
      {response && (
        <p>
          <strong>AI:</strong> {response}
        </p>
      )}
    </div>
  )
}

Performance

Latency Breakdown

Stage	Typical Latency	Optimization
STT	200-800ms	Use smaller Whisper models
LLM	300-2000ms	Use smaller models, limit tokens
TTS	100-400ms	Use medium-quality voices
Total	600-3200ms

For the fastest voice turns, use Whisper Tiny for STT, a 350M-500M LLM, and limit maxTokens to 50-60. The starter app uses maxTokens: 60 with a “1-2 sentences max” system prompt.

Error Handling

const result = await pipeline.processTurn(audio, options, {
  onStateChange: (state) => console.log('State:', state),
  onError: (error) => {
    console.error('Pipeline error:', error.message)
  },
})

VAD

Voice Activity Detection

STT

Speech-to-Text

TTS

Text-to-Speech

LLM

Text Generation

Getting Started

Swift SDK

Kotlin SDK

React Native SDK

Flutter SDK

Web SDK

Vibe Coding

Overview

Package Imports

Required Models

Setup

Basic Usage

API Reference

`VoicePipeline`

VoicePipelineOptions

VoicePipelineCallbacks

VoicePipelineTurnResult

Complete Voice Assistant Example

React Voice Assistant

Performance

Latency Breakdown

Error Handling

VAD

STT

TTS

LLM

​Overview

​Package Imports

​Required Models

​Setup

​Basic Usage

​API Reference

​VoicePipeline

​VoicePipelineOptions

​VoicePipelineCallbacks

​VoicePipelineTurnResult

​Complete Voice Assistant Example

​React Voice Assistant

​Performance

​Latency Breakdown

​Error Handling

​Related

VAD

STT

TTS

LLM

Overview

Package Imports

Required Models

Setup

Basic Usage

API Reference

`VoicePipeline`

VoicePipelineOptions

VoicePipelineCallbacks

VoicePipelineTurnResult

Complete Voice Assistant Example

React Voice Assistant

Performance

Latency Breakdown

Error Handling

Related