Voice Agent

Overview

The Voice Agent provides a complete voice conversation pipeline that orchestrates VAD → STT → LLM → TTS in a seamless flow. Use it to build voice assistants, hands-free interfaces, and conversational AI applications.

Basic Usage

import { RunAnywhere, VoiceSessionEvent, VoiceSessionHandle } from '@runanywhere/core'

// Start a voice session with event-based callback
const session: VoiceSessionHandle = await RunAnywhere.startVoiceSession(
  {
    agentConfig: {
      llmModelId: 'smollm2-360m',
      sttModelId: 'whisper-tiny-en',
      ttsModelId: 'piper-en-lessac',
      systemPrompt: 'You are a helpful voice assistant. Keep responses brief.',
      generationOptions: {
        maxTokens: 150,
        temperature: 0.7,
      },
    },
    enableVAD: true,
    vadSensitivity: 0.5,
    speechTimeout: 3000,
  },
  (event: VoiceSessionEvent) => {
    switch (event.type) {
      case 'transcriptionComplete':
        console.log('User said:', event.data?.transcript)
        break
      case 'generationComplete':
        console.log('AI response:', event.data?.response)
        break
      case 'synthesisComplete':
        // event.data?.audio contains base64 audio
        break
    }
  }
)

// Stop the session when done
await session.stop()

Setup

1. Download Required Models

import { RunAnywhere, SDKEnvironment, ModelCategory } from '@runanywhere/core'
import { LlamaCPP } from '@runanywhere/llamacpp'
import { ONNX, ModelArtifactType } from '@runanywhere/onnx'

// Initialize SDK
await RunAnywhere.initialize({ environment: SDKEnvironment.Development })

// Register backends
LlamaCPP.register()
ONNX.register()

// Add models
await LlamaCPP.addModel({
  id: 'smollm2-360m',
  name: 'SmolLM2 360M',
  url: 'https://huggingface.co/.../SmolLM2-360M.Q8_0.gguf',
  memoryRequirement: 500_000_000,
})

await ONNX.addModel({
  id: 'whisper-tiny-en',
  name: 'Whisper Tiny English',
  url: 'https://github.com/.../sherpa-onnx-whisper-tiny.en.tar.gz',
  modality: ModelCategory.SpeechRecognition,
  artifactType: ModelArtifactType.TarGzArchive,
  memoryRequirement: 75_000_000,
})

await ONNX.addModel({
  id: 'piper-en-lessac',
  name: 'Piper English',
  url: 'https://github.com/.../vits-piper-en_US-lessac-medium.tar.gz',
  modality: ModelCategory.SpeechSynthesis,
  artifactType: ModelArtifactType.TarGzArchive,
  memoryRequirement: 65_000_000,
})

// Download all models
await Promise.all([
  RunAnywhere.downloadModel('smollm2-360m'),
  RunAnywhere.downloadModel('whisper-tiny-en'),
  RunAnywhere.downloadModel('piper-en-lessac'),
])

2. Initialize Voice Agent

await RunAnywhere.initializeVoiceAgent({
  llmModelId: 'smollm2-360m',
  sttModelId: 'whisper-tiny-en',
  ttsModelId: 'piper-en-lessac',
  systemPrompt: 'You are a helpful voice assistant. Be concise and friendly.',
  generationOptions: {
    maxTokens: 150,
    temperature: 0.7,
  },
})

API Reference

`initializeVoiceAgent`

Initialize the voice agent pipeline.

await RunAnywhere.initializeVoiceAgent(
  config: VoiceAgentConfig
): Promise<boolean>

VoiceAgentConfig

interface VoiceAgentConfig {
  /** LLM model ID (must be downloaded) */
  llmModelId: string

  /** STT model ID (must be downloaded) */
  sttModelId: string

  /** TTS model ID (must be downloaded) */
  ttsModelId: string

  /** System prompt for LLM */
  systemPrompt?: string

  /** Generation options */
  generationOptions?: GenerationOptions
}

`processVoiceTurn`

Process a complete voice turn (STT → LLM → TTS).

await RunAnywhere.processVoiceTurn(
  audioData: string
): Promise<VoiceTurnResult>

VoiceTurnResult

interface VoiceTurnResult {
  /** Transcribed user speech */
  userTranscript: string

  /** LLM-generated response */
  assistantResponse: string

  /** Synthesized audio (base64 float32 PCM) */
  audio: string

  /** Performance metrics */
  metrics: VoiceAgentMetrics
}

interface VoiceAgentMetrics {
  /** STT latency in ms */
  sttLatencyMs: number

  /** LLM latency in ms */
  llmLatencyMs: number

  /** TTS latency in ms */
  ttsLatencyMs: number

  /** Total turn latency in ms */
  totalLatencyMs: number

  /** Tokens generated */
  tokensGenerated: number
}

`startVoiceSession`

Start an interactive voice session with automatic VAD.

await RunAnywhere.startVoiceSession(
  config: VoiceSessionConfig,
  callback: (event: VoiceSessionEvent) => void
): Promise<VoiceSessionHandle>

VoiceSessionConfig

interface VoiceSessionConfig {
  /** Seconds of silence before processing speech (default: 1.5) */
  silenceDuration?: number

  /** Minimum audio level to detect speech, 0.0-1.0 (default: 0.1) */
  speechThreshold?: number

  /** Auto-play TTS audio (default: true) */
  autoPlayTTS?: boolean

  /** Auto-resume listening after TTS finishes (default: true) */
  continuousMode?: boolean

  /** Language code (default: 'en') */
  language?: string

  /** System prompt for LLM */
  systemPrompt?: string

  /** Event callback */
  onEvent?: (event: VoiceSessionEvent) => void
}

VoiceSessionEvent

interface VoiceSessionEvent {
  type:
    | 'started'
    | 'listening'
    | 'speechStarted'
    | 'speechEnded'
    | 'processing'
    | 'transcribed'
    | 'responded'
    | 'speaking'
    | 'turnCompleted'
    | 'stopped'
    | 'error'

  /** Timestamp */
  timestamp: number

  /** Audio input level, 0.0-1.0 (for 'listening' events) */
  audioLevel?: number

  /** Transcribed text (for 'transcribed' and 'turnCompleted') */
  transcription?: string

  /** LLM response text (for 'responded' and 'turnCompleted') */
  response?: string

  /** TTS audio data (for 'turnCompleted') */
  audio?: string

  /** Error message (for 'error' events) */
  error?: string
}

Examples

Complete Voice Assistant

VoiceAssistant.tsx

import React, { useState, useCallback, useEffect } from 'react'
import { View, Text, Button, StyleSheet } from 'react-native'
import { RunAnywhere, VoiceSessionEvent, VoiceSessionHandle } from '@runanywhere/core'

export function VoiceAssistant() {
  const [isActive, setIsActive] = useState(false)
  const [status, setStatus] = useState('Ready')
  const [transcript, setTranscript] = useState('')
  const [response, setResponse] = useState('')
  const sessionRef = React.useRef<VoiceSessionHandle | null>(null)

  const handleEvent = useCallback((event: VoiceSessionEvent) => {
    switch (event.type) {
      case 'started':
        setStatus('Session started')
        break
      case 'listening':
        setStatus('Listening...')
        break
      case 'speechStarted':
        setStatus('Hearing you...')
        break
      case 'processing':
        setStatus('Processing...')
        break
      case 'transcribed':
        setTranscript(event.transcription || '')
        setStatus('Thinking...')
        break
      case 'responded':
        setResponse(event.response || '')
        setStatus('Speaking...')
        break
      case 'turnCompleted':
        setStatus('Listening...')
        break
      case 'error':
        setStatus('Error: ' + event.error)
        break
    }
  }, [])

  const startSession = useCallback(async () => {
    setIsActive(true)
    setTranscript('')
    setResponse('')

    sessionRef.current = await RunAnywhere.startVoiceSession({
      systemPrompt: 'You are a helpful voice assistant.',
      continuousMode: true,
      onEvent: handleEvent,
    })
  }, [handleEvent])

  const stopSession = useCallback(async () => {
    if (sessionRef.current) {
      await sessionRef.current.stop()
      sessionRef.current = null
    }
    setIsActive(false)
    setStatus('Ready')
  }, [])

  return (
    <View style={styles.container}>
      <View style={[styles.statusBadge, isActive && styles.active]}>
        <Text style={styles.statusText}>{status}</Text>
      </View>

      {transcript && (
        <View style={styles.bubble}>
          <Text style={styles.bubbleLabel}>You</Text>
          <Text style={styles.bubbleText}>{transcript}</Text>
        </View>
      )}

      {response && (
        <View style={[styles.bubble, styles.assistantBubble]}>
          <Text style={styles.bubbleLabel}>Assistant</Text>
          <Text style={styles.bubbleText}>{response}</Text>
        </View>
      )}

      <Button
        title={isActive ? 'Stop' : 'Start Voice Assistant'}
        onPress={isActive ? stopSession : startSession}
      />
    </View>
  )
}

const styles = StyleSheet.create({
  container: { flex: 1, padding: 16 },
  statusBadge: {
    alignSelf: 'center',
    paddingHorizontal: 16,
    paddingVertical: 8,
    borderRadius: 20,
    backgroundColor: '#e0e0e0',
    marginBottom: 24,
  },
  active: { backgroundColor: '#4caf50' },
  statusText: { fontWeight: '600', color: '#333' },
  bubble: {
    padding: 12,
    borderRadius: 12,
    backgroundColor: '#f5f5f5',
    marginBottom: 12,
  },
  assistantBubble: { backgroundColor: '#e3f2fd' },
  bubbleLabel: { fontSize: 12, color: '#666', marginBottom: 4 },
  bubbleText: { fontSize: 16 },
})

Single Turn Processing

// For push-to-talk or single-turn interactions
async function processSingleTurn(audioBase64: string) {
  const result = await RunAnywhere.processVoiceTurn(audioBase64)

  console.log('=== Voice Turn Results ===')
  console.log('User:', result.userTranscript)
  console.log('Assistant:', result.assistantResponse)
  console.log('')
  console.log('=== Metrics ===')
  console.log('STT:', result.metrics.sttLatencyMs, 'ms')
  console.log('LLM:', result.metrics.llmLatencyMs, 'ms')
  console.log('TTS:', result.metrics.ttsLatencyMs, 'ms')
  console.log('Total:', result.metrics.totalLatencyMs, 'ms')

  // Play the response audio
  playAudio(result.audio)

  return result
}

Custom Voice Pipeline

Build a custom pipeline for more control:

class CustomVoiceAgent {
  private conversationHistory: { role: string; content: string }[] = []

  async processTurn(audioSamples: Float32Array): Promise<void> {
    // 1. Transcribe user speech
    const sttResult = await RunAnywhere.transcribeBuffer(audioSamples, {
      language: 'en',
    })
    console.log('User:', sttResult.text)

    // 2. Add to conversation history
    this.conversationHistory.push({
      role: 'user',
      content: sttResult.text,
    })

    // 3. Generate LLM response with history
    const prompt = this.buildPrompt()
    const llmResult = await RunAnywhere.generate(prompt, {
      maxTokens: 150,
      temperature: 0.7,
      systemPrompt: 'You are a helpful assistant.',
    })
    console.log('Assistant:', llmResult.text)

    // 4. Add response to history
    this.conversationHistory.push({
      role: 'assistant',
      content: llmResult.text,
    })

    // 5. Synthesize and play response
    const ttsResult = await RunAnywhere.synthesize(llmResult.text)
    await playAudio(ttsResult.audio, ttsResult.sampleRate)
  }

  private buildPrompt(): string {
    return (
      this.conversationHistory
        .map((m) => `${m.role === 'user' ? 'User' : 'Assistant'}: ${m.content}`)
        .join('\n') + '\nAssistant:'
    )
  }

  clearHistory() {
    this.conversationHistory = []
  }
}

Performance Optimization

Latency Breakdown

Stage	Typical Latency	Optimization
VAD	< 50ms	Use appropriate frame size
STT	100-500ms	Use smaller models
LLM	200-2000ms	Use smaller models, limit tokens
TTS	100-300ms	Use streaming
Total	400-3000ms

Tips for Lower Latency

Use smaller models for faster response times. A 360M LLM with Whisper Tiny can achieve sub-second voice turns.

// Optimized for speed
await RunAnywhere.initializeVoiceAgent({
  llmModelId: 'smollm2-360m', // Small, fast model
  sttModelId: 'whisper-tiny-en', // Fastest STT
  ttsModelId: 'piper-en-lessac', // Neural TTS
  generationOptions: {
    maxTokens: 100, // Shorter responses
    temperature: 0.5, // Faster sampling
  },
})

Error Handling

try {
  const session = await RunAnywhere.startVoiceSession({
    ...config,
    onEvent: (event) => {
      if (event.type === 'error') {
        handleError(event.error)
      }
    },
  })
} catch (error) {
  if (isSDKError(error)) {
    switch (error.code) {
      case SDKErrorCode.modelNotFound:
        console.error('Download required models first')
        break
      case SDKErrorCode.voiceAgentFailed:
        console.error('Voice agent failed:', error.message)
        break
    }
  }
}

VAD

Voice Activity Detection

STT

Speech-to-Text

TTS

Text-to-Speech

LLM

Text Generation

Getting Started

Swift SDK

Kotlin SDK

React Native SDK

Flutter SDK

Web SDK

Vibe Coding

Overview

Basic Usage

Setup

1. Download Required Models

2. Initialize Voice Agent

API Reference

`initializeVoiceAgent`

VoiceAgentConfig

`processVoiceTurn`

VoiceTurnResult

`startVoiceSession`

VoiceSessionConfig

VoiceSessionEvent

Examples

Complete Voice Assistant

Single Turn Processing

Custom Voice Pipeline

Performance Optimization

Latency Breakdown

Tips for Lower Latency

Error Handling

VAD

STT

TTS

LLM

Getting Started

Swift SDK

Kotlin SDK

React Native SDK

Flutter SDK

Web SDK

Vibe Coding

​Overview

​Basic Usage

​Setup

​1. Download Required Models

​2. Initialize Voice Agent

​API Reference

​initializeVoiceAgent

​VoiceAgentConfig

​processVoiceTurn

​VoiceTurnResult

​startVoiceSession

​VoiceSessionConfig

​VoiceSessionEvent

​Examples

​Complete Voice Assistant

​Single Turn Processing

​Custom Voice Pipeline

​Performance Optimization

​Latency Breakdown

​Tips for Lower Latency

​Error Handling

​Related

VAD

STT

TTS

LLM

Overview

Basic Usage

Setup

1. Download Required Models

2. Initialize Voice Agent

API Reference

`initializeVoiceAgent`

VoiceAgentConfig

`processVoiceTurn`

VoiceTurnResult

`startVoiceSession`

VoiceSessionConfig

VoiceSessionEvent

Examples

Complete Voice Assistant

Single Turn Processing

Custom Voice Pipeline

Performance Optimization

Latency Breakdown

Tips for Lower Latency

Error Handling

Related