Introduction

Early Beta — The Web SDK is in early beta (v0.1.x). APIs may change between releases. We’d love your feedback — report issues or share ideas on GitHub.

Overview

The RunAnywhere Web SDK is a production-grade, on-device AI SDK for the browser. It compiles the same C++ inference engine used by the iOS and Android SDKs to WebAssembly, enabling developers to run LLMs, Speech-to-Text, Text-to-Speech, Vision, and Voice AI directly in the browser — private, offline-capable, and with zero server dependencies.The SDK is split into three npm packages by backend:

Package	Description
`@runanywhere/web`	Core SDK: initialization, model management, events, `VoicePipeline`
`@runanywhere/web-llamacpp`	LLM & VLM inference via llama.cpp WASM (`TextGeneration`, `VLMWorkerBridge`, `VideoCapture`)
`@runanywhere/web-onnx`	STT, TTS & VAD via sherpa-onnx WASM (`AudioCapture`, `AudioPlayback`, `VAD`)

LLM

Text generation with streaming support via llama.cpp WASM

STT

Speech-to-text transcription with Whisper and sherpa-onnx

TTS

Neural voice synthesis with Piper TTS via sherpa-onnx

VAD

Real-time voice activity detection with Silero VAD

VLM

Vision language models for image understanding

Tool Calling

Function calling and structured JSON output

Key Capabilities

Three focused packages — Core SDK + LlamaCpp backend (LLM/VLM) + ONNX backend (STT/TTS/VAD), install only what you need
Zero runtime dependencies — Everything is self-contained via WebAssembly
TypeScript-first — Full type safety with comprehensive type definitions
Privacy by default — All inference runs in-browser via WASM, no data leaves the device
Persistent storage — Models cached in OPFS (Origin Private File System) across sessions

Core Philosophy

On-Device First

All AI inference runs locally in the browser via WebAssembly. Once models are downloaded, no network connection is required for inference. Audio, text, and images never leave the device.

Modular Package Architecture

The Web SDK splits functionality across three packages by inference backend: @runanywhere/web (core), @runanywhere/web-llamacpp (LLM/VLM via llama.cpp WASM), and @runanywhere/web-onnx (STT/TTS/VAD via sherpa-onnx WASM). This lets you install only the backends you need.

Privacy by Design

All data stays in the browser. No server calls, no API keys required for inference. Model files are stored in the browser’s sandboxed OPFS storage.

Platform Parity

The Web SDK compiles the same C++ core as the iOS and Android SDKs to WebAssembly. Identical inference logic, consistent results across all platforms.

Features

Language Models (LLM)

On-device text generation with streaming support
llama.cpp backend compiled to WASM (Liquid AI LFM2, Llama, Mistral, Qwen, SmolLM, and other GGUF models)
Configurable system prompts, temperature, top-k/top-p, and max tokens
Token streaming with async iterators and cancellation
Result metrics: tokensUsed, tokensPerSecond, latencyMs

Speech-to-Text (STT)

Offline speech recognition via whisper.cpp and sherpa-onnx (WASM)
Multiple model architectures: Whisper, Zipformer, Paraformer
Batch transcription from Float32Array audio data
Real-time streaming transcription sessions

Text-to-Speech (TTS)

Neural voice synthesis via sherpa-onnx Piper TTS (WASM)
Multiple voice models with configurable speed and speaker
PCM audio output (Float32Array) with sample rate metadata

Voice Activity Detection (VAD)

Silero VAD model via sherpa-onnx (WASM)
Real-time speech/silence detection from audio streams
Speech segment extraction with configurable thresholds
Callback-based speech activity events

Voice Pipeline

Full STT -> LLM (streaming) -> TTS orchestration
Callback-driven state transitions (transcription, generation, synthesis)
Cancellation support for in-progress turns
Multi-model coexistence via coexist flag

Vision Language Models (VLM)

Multimodal image+text inference via llama.cpp with mtmd backend
Camera integration with VideoCapture class
Runs in a dedicated Web Worker via VLMWorkerBridge for responsive UI
Supports Liquid AI LFM2-VL, Qwen2-VL, SmolVLM, and LLaVA architectures

Tool Calling & Structured Output

Function calling with typed tool definitions and parameter schemas
Automatic tool orchestration loop (generate -> parse -> execute -> continue)
JSON schema-guided generation with WASM-powered validation
Supports default XML format and LFM2 Pythonic format

System Requirements

Component	Minimum	Recommended
Browser	Chrome 96+ / Edge 96+	Chrome 120+ / Edge 120+
WebAssembly	Required	Required
SharedArrayBuffer	For multi-threaded	Requires COOP/COEP headers
OPFS	For model storage	All modern browsers
RAM	2GB	4GB+ for larger models

Cross-Origin Isolation headers (COOP and COEP) are required for multi-threaded WASM via SharedArrayBuffer. Without them, the SDK falls back to single-threaded mode. See Configuration for setup details.

SDK Architecture

Key Differences from Native SDKs

Aspect	Native SDKs (iOS/Android/RN)	Web SDK
Package	Multiple packages per backend	Three packages: `web`, `web-llamacpp`, `web-onnx`
Runtime	Native code	WebAssembly
Storage	File system	OPFS (browser sandbox)
Audio	Platform APIs	Web Audio API
GPU	Metal / Vulkan	WebGPU (when available)
Threading	OS threads	SharedArrayBuffer + COOP/COEP
Install	npm + native build	npm only

Example App

A full-featured starter application is included in the SDK repository:

Web Starter App — Chat, Vision, and Voice demos with React + Vite

Next Steps

Installation

Add the SDK packages to your project via npm

Quick Start

Build your first browser AI feature in minutes

Getting Started

Swift SDK

Kotlin SDK

React Native SDK

Flutter SDK

Web SDK

Vibe Coding

Overview

LLM

STT

TTS

VAD

VLM

Tool Calling

Key Capabilities

Core Philosophy

Features

Language Models (LLM)

Speech-to-Text (STT)

Text-to-Speech (TTS)

Voice Activity Detection (VAD)

Voice Pipeline

Vision Language Models (VLM)

Tool Calling & Structured Output

System Requirements

SDK Architecture

Key Differences from Native SDKs

Example App

Next Steps

Installation

Quick Start

​Overview

LLM

STT

TTS

VAD

VLM

Tool Calling

​Key Capabilities

​Core Philosophy

​Features

​Language Models (LLM)

​Speech-to-Text (STT)

​Text-to-Speech (TTS)

​Voice Activity Detection (VAD)

​Voice Pipeline

​Vision Language Models (VLM)

​Tool Calling & Structured Output

​System Requirements

​SDK Architecture

​Key Differences from Native SDKs

​Example App

​Next Steps

Installation

Quick Start

Overview

Key Capabilities

Core Philosophy

Features

Language Models (LLM)

Speech-to-Text (STT)

Text-to-Speech (TTS)

Voice Activity Detection (VAD)

Voice Pipeline

Vision Language Models (VLM)

Tool Calling & Structured Output

System Requirements

SDK Architecture

Key Differences from Native SDKs

Example App

Next Steps