Skip to main content
Early Beta — The Web SDK is in early beta. APIs may change between releases.

Overview

The Voice Pipeline provides complete STT -> LLM -> TTS orchestration for building voice conversation experiences in the browser. It handles the full flow: transcribe user speech, generate an AI response with streaming, and synthesize spoken audio.

Package Imports

The Voice Pipeline uses classes from all three packages:
import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'

Required Models

The voice pipeline requires exactly 4 models loaded simultaneously. Each serves a distinct purpose:
RoleModel CategoryExample ModelWhat It Does
VADModelCategory.AudioSilero VAD v5 (~5MB)Detects when the user starts/stops speaking
STTModelCategory.SpeechRecognitionWhisper Tiny EN (~105MB)Transcribes spoken audio to text
LLMModelCategory.LanguageLFM2 350M (~250MB)Generates the AI text response
TTSModelCategory.SpeechSynthesisPiper TTS (~65MB)Converts the AI response to spoken audio
VAD is NOT a replacement for STT. This is a common source of confusion. VAD (Voice Activity Detection) only detects when someone is speaking — it outputs speech boundaries (started/ended), not text. STT (Speech-to-Text) performs actual transcription of audio to text. You need both: VAD to segment the audio stream into speech chunks, and STT to transcribe those chunks into text.If your voice feature can detect speech activity but produces no transcription text, you are likely missing the STT model. A 5MB VAD model cannot do speech recognition — you need a Whisper model (~105MB+) for that.

Setup

Before using the Voice Pipeline, ensure all 4 models are loaded with coexist: true so they stay in memory simultaneously:
import {
  RunAnywhere,
  SDKEnvironment,
  ModelManager,
  ModelCategory,
  LLMFramework,
} from '@runanywhere/web'
import { LlamaCPP } from '@runanywhere/web-llamacpp'
import { ONNX } from '@runanywhere/web-onnx'

// 1. Initialize SDK and register backends
await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()
await ONNX.register()

// 2. Register all voice models
RunAnywhere.registerModels([
  {
    id: 'lfm2-350m-q4_k_m',
    name: 'LFM2 350M Q4_K_M',
    repo: 'LiquidAI/LFM2-350M-GGUF',
    files: ['LFM2-350M-Q4_K_M.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Language,
    memoryRequirement: 250_000_000,
  },
  {
    id: 'sherpa-onnx-whisper-tiny.en',
    name: 'Whisper Tiny English',
    url: 'https://huggingface.co/runanywhere/sherpa-onnx-whisper-tiny.en/resolve/main/sherpa-onnx-whisper-tiny.en.tar.gz',
    framework: LLMFramework.ONNX,
    modality: ModelCategory.SpeechRecognition,
    memoryRequirement: 105_000_000,
    artifactType: 'archive' as const,
  },
  {
    id: 'vits-piper-en_US-lessac-medium',
    name: 'Piper TTS',
    url: 'https://huggingface.co/runanywhere/vits-piper-en_US-lessac-medium/resolve/main/vits-piper-en_US-lessac-medium.tar.gz',
    framework: LLMFramework.ONNX,
    modality: ModelCategory.SpeechSynthesis,
    memoryRequirement: 65_000_000,
    artifactType: 'archive' as const,
  },
  {
    id: 'silero-vad-v5',
    name: 'Silero VAD v5',
    url: 'https://huggingface.co/runanywhere/silero-vad-v5/resolve/main/silero_vad.onnx',
    files: ['silero_vad.onnx'],
    framework: LLMFramework.ONNX,
    modality: ModelCategory.Audio,
    memoryRequirement: 5_000_000,
  },
])

// 3. Download and load all models with coexist: true
for (const modelId of [
  'silero-vad-v5',
  'sherpa-onnx-whisper-tiny.en',
  'lfm2-350m-q4_k_m',
  'vits-piper-en_US-lessac-medium',
]) {
  await ModelManager.downloadModel(modelId)
  await ModelManager.loadModel(modelId, { coexist: true })
}
coexist: true is required. Without it, loading a new model unloads the previous one, and the voice pipeline needs all 4 models (VAD, STT, LLM, TTS) in memory simultaneously.
Verify all 4 models are loaded before starting the pipeline. Use ModelManager.getLoadedModel() to check each category:
const vad = ModelManager.getLoadedModel(ModelCategory.Audio)
const stt = ModelManager.getLoadedModel(ModelCategory.SpeechRecognition)
const llm = ModelManager.getLoadedModel(ModelCategory.Language)
const tts = ModelManager.getLoadedModel(ModelCategory.SpeechSynthesis)

if (!vad || !stt || !llm || !tts) {
  console.error('Missing models:', {
    vad: !!vad,
    stt: !!stt,
    llm: !!llm,
    tts: !!tts,
  })
  throw new Error('All 4 voice pipeline models must be loaded')
}
If STT is missing, the pipeline will fail silently or produce empty transcriptions. If VAD is missing, speech detection won’t trigger.

Basic Usage

import { VoicePipeline } from '@runanywhere/web'

const pipeline = new VoicePipeline()

const result = await pipeline.processTurn(
  audioFloat32Array,
  {
    maxTokens: 60,
    temperature: 0.7,
    systemPrompt: 'You are a helpful voice assistant. Keep responses concise — 1-2 sentences max.',
  },
  {
    onTranscription: (text) => console.log('User said:', text),
    onResponseToken: (token, accumulated) => console.log('AI:', accumulated),
    onResponseComplete: (text) => console.log('Full response:', text),
    onSynthesisComplete: async (audio, sampleRate) => {
      const player = new AudioPlayback({ sampleRate })
      await player.play(audio, sampleRate)
      player.dispose()
    },
    onStateChange: (state) => console.log('Pipeline state:', state),
  }
)

console.log('Transcription:', result.transcription)
console.log('Response:', result.response)

API Reference

VoicePipeline

class VoicePipeline {
  /** Process a complete voice turn: STT -> LLM -> TTS */
  processTurn(
    audioData: Float32Array,
    options?: VoicePipelineOptions,
    callbacks?: VoicePipelineCallbacks
  ): Promise<VoicePipelineTurnResult>

  /** Cancel an in-progress turn */
  cancel(): void
}

VoicePipelineOptions

interface VoicePipelineOptions {
  /** Maximum tokens for LLM response (default: 150) */
  maxTokens?: number

  /** LLM temperature (default: 0.7) */
  temperature?: number

  /** System prompt for the LLM */
  systemPrompt?: string

  /** TTS speech speed (default: 1.0) */
  ttsSpeed?: number

  /** Audio sample rate (default: 16000) */
  sampleRate?: number
}

VoicePipelineCallbacks

interface VoicePipelineCallbacks {
  /** Called when pipeline state changes */
  onStateChange?: (state: string) => void
  // States: 'processingSTT', 'generatingResponse', 'playingTTS'

  /** Called when transcription completes */
  onTranscription?: (text: string) => void

  /** Called for each generated token */
  onResponseToken?: (token: string, accumulated: string) => void

  /** Called when LLM response completes */
  onResponseComplete?: (text: string) => void

  /** Called when TTS synthesis completes */
  onSynthesisComplete?: (audio: Float32Array, sampleRate: number) => void

  /** Called on error */
  onError?: (error: Error) => void
}

VoicePipelineTurnResult

interface VoicePipelineTurnResult {
  /** Transcribed user speech */
  transcription: string

  /** LLM-generated response */
  response: string
}

Complete Voice Assistant Example

This example combines VAD for automatic speech detection with the Voice Pipeline for full STT -> LLM -> TTS processing:
import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'

const pipeline = new VoicePipeline()
const mic = new AudioCapture({ sampleRate: 16000 })

// Set up VAD to detect speech segments
VAD.reset()

const vadUnsub = VAD.onSpeechActivity(async (activity) => {
  if (activity === SpeechActivity.Ended) {
    const segment = VAD.popSpeechSegment()
    if (!segment || segment.samples.length < 1600) return

    // Stop mic during processing
    mic.stop()
    vadUnsub()

    // Process the voice turn
    const result = await pipeline.processTurn(
      segment.samples,
      {
        maxTokens: 60,
        temperature: 0.7,
        systemPrompt: 'You are a helpful assistant. Be concise.',
      },
      {
        onTranscription: (text) => {
          document.getElementById('user-text')!.textContent = text
        },
        onResponseToken: (_, accumulated) => {
          document.getElementById('ai-text')!.textContent = accumulated
        },
        onResponseComplete: (text) => {
          document.getElementById('ai-text')!.textContent = text
        },
        onSynthesisComplete: async (audio, sampleRate) => {
          const player = new AudioPlayback({ sampleRate })
          await player.play(audio, sampleRate)
          player.dispose()
        },
      }
    )

    console.log('Turn complete:', result.transcription, '->', result.response)
  }
})

// Feed microphone audio to VAD
await mic.start(
  (chunk) => {
    VAD.processSamples(chunk)
  },
  (level) => {
    // Visualize audio level (0.0-1.0)
    document.getElementById('orb')!.style.transform = `scale(${1 + level * 0.5})`
  }
)

React Voice Assistant

VoiceAssistant.tsx
import { useState, useCallback, useRef, useEffect } from 'react'
import { VoicePipeline, ModelManager, ModelCategory } from '@runanywhere/web'
import { AudioCapture, AudioPlayback, VAD, SpeechActivity } from '@runanywhere/web-onnx'

type VoiceState = 'idle' | 'loading' | 'listening' | 'processing' | 'speaking'

export function VoiceAssistant() {
  const [voiceState, setVoiceState] = useState<VoiceState>('idle')
  const [transcript, setTranscript] = useState('')
  const [response, setResponse] = useState('')
  const [audioLevel, setAudioLevel] = useState(0)

  const micRef = useRef<AudioCapture | null>(null)
  const pipelineRef = useRef<VoicePipeline | null>(null)
  const vadUnsubRef = useRef<(() => void) | null>(null)

  useEffect(() => {
    return () => {
      micRef.current?.stop()
      vadUnsubRef.current?.()
    }
  }, [])

  const startListening = useCallback(async () => {
    setTranscript('')
    setResponse('')
    setVoiceState('loading')

    // Ensure all voice models are loaded
    const categories = [
      ModelCategory.Audio,
      ModelCategory.SpeechRecognition,
      ModelCategory.Language,
      ModelCategory.SpeechSynthesis,
    ]

    for (const cat of categories) {
      if (!ModelManager.getLoadedModel(cat)) {
        const models = ModelManager.getModels().filter((m) => m.modality === cat)
        if (models[0]) {
          await ModelManager.downloadModel(models[0].id)
          await ModelManager.loadModel(models[0].id, { coexist: true })
        }
      }
    }

    setVoiceState('listening')

    const mic = new AudioCapture({ sampleRate: 16000 })
    micRef.current = mic

    if (!pipelineRef.current) {
      pipelineRef.current = new VoicePipeline()
    }

    VAD.reset()

    vadUnsubRef.current = VAD.onSpeechActivity(async (activity) => {
      if (activity === SpeechActivity.Ended) {
        const segment = VAD.popSpeechSegment()
        if (!segment || segment.samples.length < 1600) return

        mic.stop()
        vadUnsubRef.current?.()
        setVoiceState('processing')

        try {
          await pipelineRef.current!.processTurn(
            segment.samples,
            {
              maxTokens: 60,
              temperature: 0.7,
              systemPrompt: 'You are a helpful voice assistant. Keep responses concise.',
            },
            {
              onTranscription: (text) => setTranscript(text),
              onResponseToken: (_, acc) => setResponse(acc),
              onResponseComplete: (text) => setResponse(text),
              onSynthesisComplete: async (audio, sampleRate) => {
                setVoiceState('speaking')
                const player = new AudioPlayback({ sampleRate })
                await player.play(audio, sampleRate)
                player.dispose()
              },
            }
          )
        } catch (err) {
          console.error('Voice pipeline error:', err)
        }

        setVoiceState('idle')
        setAudioLevel(0)
      }
    })

    await mic.start(
      (chunk) => {
        VAD.processSamples(chunk)
      },
      (level) => {
        setAudioLevel(level)
      }
    )
  }, [])

  const stopListening = useCallback(() => {
    micRef.current?.stop()
    vadUnsubRef.current?.()
    setVoiceState('idle')
    setAudioLevel(0)
  }, [])

  return (
    <div style={{ padding: 24, textAlign: 'center' }}>
      <div
        style={{
          width: 80,
          height: 80,
          borderRadius: '50%',
          background:
            voiceState === 'listening'
              ? '#4caf50'
              : voiceState === 'speaking'
                ? '#2196f3'
                : '#e0e0e0',
          margin: '0 auto 16px',
          transform: `scale(${1 + audioLevel * 0.3})`,
          transition: 'background-color 0.2s',
        }}
      />

      <p>
        {voiceState === 'idle' && 'Tap to start listening'}
        {voiceState === 'loading' && 'Loading models...'}
        {voiceState === 'listening' && 'Listening... speak now'}
        {voiceState === 'processing' && 'Processing...'}
        {voiceState === 'speaking' && 'Speaking...'}
      </p>

      {voiceState === 'idle' || voiceState === 'loading' ? (
        <button onClick={startListening} disabled={voiceState === 'loading'}>
          Start Listening
        </button>
      ) : voiceState === 'listening' ? (
        <button onClick={stopListening}>Stop</button>
      ) : null}

      {transcript && (
        <p>
          <strong>You:</strong> {transcript}
        </p>
      )}
      {response && (
        <p>
          <strong>AI:</strong> {response}
        </p>
      )}
    </div>
  )
}

Performance

Latency Breakdown

StageTypical LatencyOptimization
STT200-800msUse smaller Whisper models
LLM300-2000msUse smaller models, limit tokens
TTS100-400msUse medium-quality voices
Total600-3200ms
For the fastest voice turns, use Whisper Tiny for STT, a 350M-500M LLM, and limit maxTokens to 50-60. The starter app uses maxTokens: 60 with a “1-2 sentences max” system prompt.

Error Handling

const result = await pipeline.processTurn(audio, options, {
  onStateChange: (state) => console.log('State:', state),
  onError: (error) => {
    console.error('Pipeline error:', error.message)
  },
})