Early Beta — The Web SDK is in early beta. APIs may change between releases.
Overview
The Voice Pipeline provides complete STT -> LLM -> TTS orchestration for building voice conversation experiences in the browser. It handles the full flow: transcribe user speech, generate an AI response with streaming, and synthesize spoken audio.Package Imports
The Voice Pipeline uses classes from all three packages:Copy
Ask AI
import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'
Required Models
The voice pipeline requires exactly 4 models loaded simultaneously. Each serves a distinct purpose:| Role | Model Category | Example Model | What It Does |
|---|---|---|---|
| VAD | ModelCategory.Audio | Silero VAD v5 (~5MB) | Detects when the user starts/stops speaking |
| STT | ModelCategory.SpeechRecognition | Whisper Tiny EN (~105MB) | Transcribes spoken audio to text |
| LLM | ModelCategory.Language | LFM2 350M (~250MB) | Generates the AI text response |
| TTS | ModelCategory.SpeechSynthesis | Piper TTS (~65MB) | Converts the AI response to spoken audio |
VAD is NOT a replacement for STT. This is a common source of confusion. VAD (Voice Activity
Detection) only detects when someone is speaking — it outputs speech boundaries (started/ended),
not text. STT (Speech-to-Text) performs actual transcription of audio to text. You need both:
VAD to segment the audio stream into speech chunks, and STT to transcribe those chunks into text.If your voice feature can detect speech activity but produces no transcription text, you are likely
missing the STT model. A 5MB VAD model cannot do speech recognition — you need a Whisper model
(~105MB+) for that.
Setup
Before using the Voice Pipeline, ensure all 4 models are loaded withcoexist: true so they stay in memory simultaneously:
Copy
Ask AI
import {
RunAnywhere,
SDKEnvironment,
ModelManager,
ModelCategory,
LLMFramework,
} from '@runanywhere/web'
import { LlamaCPP } from '@runanywhere/web-llamacpp'
import { ONNX } from '@runanywhere/web-onnx'
// 1. Initialize SDK and register backends
await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()
await ONNX.register()
// 2. Register all voice models
RunAnywhere.registerModels([
{
id: 'lfm2-350m-q4_k_m',
name: 'LFM2 350M Q4_K_M',
repo: 'LiquidAI/LFM2-350M-GGUF',
files: ['LFM2-350M-Q4_K_M.gguf'],
framework: LLMFramework.LlamaCpp,
modality: ModelCategory.Language,
memoryRequirement: 250_000_000,
},
{
id: 'sherpa-onnx-whisper-tiny.en',
name: 'Whisper Tiny English',
url: 'https://huggingface.co/runanywhere/sherpa-onnx-whisper-tiny.en/resolve/main/sherpa-onnx-whisper-tiny.en.tar.gz',
framework: LLMFramework.ONNX,
modality: ModelCategory.SpeechRecognition,
memoryRequirement: 105_000_000,
artifactType: 'archive' as const,
},
{
id: 'vits-piper-en_US-lessac-medium',
name: 'Piper TTS',
url: 'https://huggingface.co/runanywhere/vits-piper-en_US-lessac-medium/resolve/main/vits-piper-en_US-lessac-medium.tar.gz',
framework: LLMFramework.ONNX,
modality: ModelCategory.SpeechSynthesis,
memoryRequirement: 65_000_000,
artifactType: 'archive' as const,
},
{
id: 'silero-vad-v5',
name: 'Silero VAD v5',
url: 'https://huggingface.co/runanywhere/silero-vad-v5/resolve/main/silero_vad.onnx',
files: ['silero_vad.onnx'],
framework: LLMFramework.ONNX,
modality: ModelCategory.Audio,
memoryRequirement: 5_000_000,
},
])
// 3. Download and load all models with coexist: true
for (const modelId of [
'silero-vad-v5',
'sherpa-onnx-whisper-tiny.en',
'lfm2-350m-q4_k_m',
'vits-piper-en_US-lessac-medium',
]) {
await ModelManager.downloadModel(modelId)
await ModelManager.loadModel(modelId, { coexist: true })
}
coexist: true is required. Without it, loading a new model unloads the previous one, and the
voice pipeline needs all 4 models (VAD, STT, LLM, TTS) in memory simultaneously.Verify all 4 models are loaded before starting the pipeline. Use If STT is missing, the pipeline will fail silently or produce empty transcriptions. If VAD is
missing, speech detection won’t trigger.
ModelManager.getLoadedModel()
to check each category:Copy
Ask AI
const vad = ModelManager.getLoadedModel(ModelCategory.Audio)
const stt = ModelManager.getLoadedModel(ModelCategory.SpeechRecognition)
const llm = ModelManager.getLoadedModel(ModelCategory.Language)
const tts = ModelManager.getLoadedModel(ModelCategory.SpeechSynthesis)
if (!vad || !stt || !llm || !tts) {
console.error('Missing models:', {
vad: !!vad,
stt: !!stt,
llm: !!llm,
tts: !!tts,
})
throw new Error('All 4 voice pipeline models must be loaded')
}
Basic Usage
Copy
Ask AI
import { VoicePipeline } from '@runanywhere/web'
const pipeline = new VoicePipeline()
const result = await pipeline.processTurn(
audioFloat32Array,
{
maxTokens: 60,
temperature: 0.7,
systemPrompt: 'You are a helpful voice assistant. Keep responses concise — 1-2 sentences max.',
},
{
onTranscription: (text) => console.log('User said:', text),
onResponseToken: (token, accumulated) => console.log('AI:', accumulated),
onResponseComplete: (text) => console.log('Full response:', text),
onSynthesisComplete: async (audio, sampleRate) => {
const player = new AudioPlayback({ sampleRate })
await player.play(audio, sampleRate)
player.dispose()
},
onStateChange: (state) => console.log('Pipeline state:', state),
}
)
console.log('Transcription:', result.transcription)
console.log('Response:', result.response)
API Reference
VoicePipeline
Copy
Ask AI
class VoicePipeline {
/** Process a complete voice turn: STT -> LLM -> TTS */
processTurn(
audioData: Float32Array,
options?: VoicePipelineOptions,
callbacks?: VoicePipelineCallbacks
): Promise<VoicePipelineTurnResult>
/** Cancel an in-progress turn */
cancel(): void
}
VoicePipelineOptions
Copy
Ask AI
interface VoicePipelineOptions {
/** Maximum tokens for LLM response (default: 150) */
maxTokens?: number
/** LLM temperature (default: 0.7) */
temperature?: number
/** System prompt for the LLM */
systemPrompt?: string
/** TTS speech speed (default: 1.0) */
ttsSpeed?: number
/** Audio sample rate (default: 16000) */
sampleRate?: number
}
VoicePipelineCallbacks
Copy
Ask AI
interface VoicePipelineCallbacks {
/** Called when pipeline state changes */
onStateChange?: (state: string) => void
// States: 'processingSTT', 'generatingResponse', 'playingTTS'
/** Called when transcription completes */
onTranscription?: (text: string) => void
/** Called for each generated token */
onResponseToken?: (token: string, accumulated: string) => void
/** Called when LLM response completes */
onResponseComplete?: (text: string) => void
/** Called when TTS synthesis completes */
onSynthesisComplete?: (audio: Float32Array, sampleRate: number) => void
/** Called on error */
onError?: (error: Error) => void
}
VoicePipelineTurnResult
Copy
Ask AI
interface VoicePipelineTurnResult {
/** Transcribed user speech */
transcription: string
/** LLM-generated response */
response: string
}
Complete Voice Assistant Example
This example combines VAD for automatic speech detection with the Voice Pipeline for full STT -> LLM -> TTS processing:Copy
Ask AI
import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'
const pipeline = new VoicePipeline()
const mic = new AudioCapture({ sampleRate: 16000 })
// Set up VAD to detect speech segments
VAD.reset()
const vadUnsub = VAD.onSpeechActivity(async (activity) => {
if (activity === SpeechActivity.Ended) {
const segment = VAD.popSpeechSegment()
if (!segment || segment.samples.length < 1600) return
// Stop mic during processing
mic.stop()
vadUnsub()
// Process the voice turn
const result = await pipeline.processTurn(
segment.samples,
{
maxTokens: 60,
temperature: 0.7,
systemPrompt: 'You are a helpful assistant. Be concise.',
},
{
onTranscription: (text) => {
document.getElementById('user-text')!.textContent = text
},
onResponseToken: (_, accumulated) => {
document.getElementById('ai-text')!.textContent = accumulated
},
onResponseComplete: (text) => {
document.getElementById('ai-text')!.textContent = text
},
onSynthesisComplete: async (audio, sampleRate) => {
const player = new AudioPlayback({ sampleRate })
await player.play(audio, sampleRate)
player.dispose()
},
}
)
console.log('Turn complete:', result.transcription, '->', result.response)
}
})
// Feed microphone audio to VAD
await mic.start(
(chunk) => {
VAD.processSamples(chunk)
},
(level) => {
// Visualize audio level (0.0-1.0)
document.getElementById('orb')!.style.transform = `scale(${1 + level * 0.5})`
}
)
React Voice Assistant
VoiceAssistant.tsx
Copy
Ask AI
import { useState, useCallback, useRef, useEffect } from 'react'
import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'
type VoiceState = 'idle' | 'loading' | 'listening' | 'processing' | 'speaking'
export function VoiceAssistant() {
const [voiceState, setVoiceState] = useState<VoiceState>('idle')
const [transcript, setTranscript] = useState('')
const [response, setResponse] = useState('')
const [audioLevel, setAudioLevel] = useState(0)
const micRef = useRef<AudioCapture | null>(null)
const pipelineRef = useRef<VoicePipeline | null>(null)
const vadUnsubRef = useRef<(() => void) | null>(null)
useEffect(() => {
return () => {
micRef.current?.stop()
vadUnsubRef.current?.()
}
}, [])
const startListening = useCallback(async () => {
setTranscript('')
setResponse('')
setVoiceState('loading')
// Ensure all voice models are loaded
const categories = [
ModelCategory.Audio,
ModelCategory.SpeechRecognition,
ModelCategory.Language,
ModelCategory.SpeechSynthesis,
]
for (const cat of categories) {
if (!ModelManager.getLoadedModel(cat)) {
const models = ModelManager.getModels().filter((m) => m.modality === cat)
if (models[0]) {
await ModelManager.downloadModel(models[0].id)
await ModelManager.loadModel(models[0].id, { coexist: true })
}
}
}
setVoiceState('listening')
const mic = new AudioCapture({ sampleRate: 16000 })
micRef.current = mic
if (!pipelineRef.current) {
pipelineRef.current = new VoicePipeline()
}
VAD.reset()
vadUnsubRef.current = VAD.onSpeechActivity(async (activity) => {
if (activity === SpeechActivity.Ended) {
const segment = VAD.popSpeechSegment()
if (!segment || segment.samples.length < 1600) return
mic.stop()
vadUnsubRef.current?.()
setVoiceState('processing')
try {
await pipelineRef.current!.processTurn(
segment.samples,
{
maxTokens: 60,
temperature: 0.7,
systemPrompt: 'You are a helpful voice assistant. Keep responses concise.',
},
{
onTranscription: (text) => setTranscript(text),
onResponseToken: (_, acc) => setResponse(acc),
onResponseComplete: (text) => setResponse(text),
onSynthesisComplete: async (audio, sampleRate) => {
setVoiceState('speaking')
const player = new AudioPlayback({ sampleRate })
await player.play(audio, sampleRate)
player.dispose()
},
}
)
} catch (err) {
console.error('Voice pipeline error:', err)
}
setVoiceState('idle')
setAudioLevel(0)
}
})
await mic.start(
(chunk) => {
VAD.processSamples(chunk)
},
(level) => {
setAudioLevel(level)
}
)
}, [])
const stopListening = useCallback(() => {
micRef.current?.stop()
vadUnsubRef.current?.()
setVoiceState('idle')
setAudioLevel(0)
}, [])
return (
<div style={{ padding: 24, textAlign: 'center' }}>
<div
style={{
width: 80,
height: 80,
borderRadius: '50%',
background:
voiceState === 'listening'
? '#4caf50'
: voiceState === 'speaking'
? '#2196f3'
: '#e0e0e0',
margin: '0 auto 16px',
transform: `scale(${1 + audioLevel * 0.3})`,
transition: 'background-color 0.2s',
}}
/>
<p>
{voiceState === 'idle' && 'Tap to start listening'}
{voiceState === 'loading' && 'Loading models...'}
{voiceState === 'listening' && 'Listening... speak now'}
{voiceState === 'processing' && 'Processing...'}
{voiceState === 'speaking' && 'Speaking...'}
</p>
{voiceState === 'idle' || voiceState === 'loading' ? (
<button onClick={startListening} disabled={voiceState === 'loading'}>
Start Listening
</button>
) : voiceState === 'listening' ? (
<button onClick={stopListening}>Stop</button>
) : null}
{transcript && (
<p>
<strong>You:</strong> {transcript}
</p>
)}
{response && (
<p>
<strong>AI:</strong> {response}
</p>
)}
</div>
)
}
Performance
Latency Breakdown
| Stage | Typical Latency | Optimization |
|---|---|---|
| STT | 200-800ms | Use smaller Whisper models |
| LLM | 300-2000ms | Use smaller models, limit tokens |
| TTS | 100-400ms | Use medium-quality voices |
| Total | 600-3200ms |
For the fastest voice turns, use Whisper Tiny for STT, a 350M-500M LLM, and limit
maxTokens to
50-60. The starter app uses maxTokens: 60 with a “1-2 sentences max” system prompt.Error Handling
Copy
Ask AI
const result = await pipeline.processTurn(audio, options, {
onStateChange: (state) => console.log('State:', state),
onError: (error) => {
console.error('Pipeline error:', error.message)
},
})