Early Beta — The Web SDK is in early beta. APIs may change between releases.
Overview
This guide covers best practices for building performant, reliable, and user-friendly AI applications with the RunAnywhere Web SDK in the browser.
Model Selection
Choose the Right Model Size
| Model Size | RAM Required | Use Case | Speed |
|---|
| 360M-500M (Q4) | ~300-500MB | Quick responses, chat | Very Fast |
| 1B-3B (Q4) | 1-2GB | Balanced quality/speed | Fast |
| 7B (Q4) | 4-5GB | High quality | Slower |
Browser memory is more limited than native apps. Models larger than 2GB may cause tab crashes on
devices with limited RAM. Start with smaller models and test on target devices.
Quantization Trade-offs
| Quantization | Quality | Size | Speed |
|---|
| Q8_0 | Best | Largest | Slower |
| Q6_K | Great | Large | Fast |
| Q4_K_M | Good | Medium | Faster |
| Q4_0 | Acceptable | Small | Fastest |
For browser use, Q4_0 and Q4_K_M offer the best balance of quality and memory efficiency. Start
with smaller quantizations and only increase if output quality is insufficient.
Use Streaming for Better UX
import { TextGeneration } from '@runanywhere/web-llamacpp'
// Bad: User waits for entire response
const result = await TextGeneration.generate(prompt, { maxTokens: 500 })
document.getElementById('output')!.textContent = result.text
// Good: User sees response as it's generated
const { stream } = await TextGeneration.generateStream(prompt, { maxTokens: 500 })
let text = ''
for await (const token of stream) {
text += token
document.getElementById('output')!.textContent = text
}
Enable Cross-Origin Isolation
Multi-threaded WASM is significantly faster. Always configure COOP/COEP headers:
import { LlamaCPP } from '@runanywhere/web-llamacpp'
// Check after backend registration
await LlamaCPP.register()
console.log('Acceleration:', LlamaCPP.accelerationMode) // 'webgpu' or 'cpu'
if (!crossOriginIsolated) {
console.warn('Running in single-threaded mode. Add COOP/COEP headers for better performance.')
}
Exclude WASM Packages from Vite Pre-Bundling
This is the most common gotcha with Vite:
export default defineConfig({
optimizeDeps: {
exclude: ['@runanywhere/web-llamacpp', '@runanywhere/web-onnx'],
},
})
Without this, import.meta.url resolves to the wrong paths and WASM files won’t be found.
Limit Token Generation
import { TextGeneration } from '@runanywhere/web-llamacpp'
// For quick responses
const { stream } = await TextGeneration.generateStream(prompt, {
maxTokens: 100,
temperature: 0.5,
})
// For detailed responses
const { stream: detailedStream } = await TextGeneration.generateStream(prompt, {
maxTokens: 500,
temperature: 0.7,
})
Batch DOM Updates
For fast token generation, throttle UI updates to avoid rendering bottlenecks:
let pending = ''
let frameId: number | null = null
function appendToken(token: string) {
pending += token
if (!frameId) {
frameId = requestAnimationFrame(() => {
document.getElementById('output')!.textContent = pending
frameId = null
})
}
}
Model Management
Use OPFS for Persistent Storage
Download models to OPFS so they persist across browser sessions:
import { RunAnywhere, ModelManager, ModelCategory, LLMFramework, EventBus } from '@runanywhere/web'
RunAnywhere.registerModels([
{
id: 'lfm2-350m-q4_k_m',
name: 'LFM2 350M',
repo: 'LiquidAI/LFM2-350M-GGUF',
files: ['LFM2-350M-Q4_K_M.gguf'],
framework: LLMFramework.LlamaCpp,
modality: ModelCategory.Language,
memoryRequirement: 250_000_000,
},
])
// First visit: download model
await ModelManager.downloadModel('lfm2-350m-q4_k_m')
await ModelManager.loadModel('lfm2-350m-q4_k_m')
// Subsequent visits: model loads from OPFS (no re-download)
Show Download Progress
import { EventBus } from '@runanywhere/web'
EventBus.shared.on('model.downloadProgress', (evt) => {
const percent = ((evt.progress ?? 0) * 100).toFixed(0)
document.getElementById('progress')!.textContent = `Downloading: ${percent}%`
})
Handle Large Model Downloads
Models over ~200MB can crash the browser tab, especially on memory-constrained devices. Mitigations:
// Check available memory before downloading
if ('deviceMemory' in navigator) {
const memoryGB = (navigator as any).deviceMemory
const modelSizeMB = 250
if (memoryGB < 4 && modelSizeMB > 200) {
console.warn('Low memory device — large model downloads may fail.')
}
}
// Monitor for download stalls (no progress for 30s may indicate trouble)
let lastProgress = 0
let lastTime = Date.now()
EventBus.shared.on('model.downloadProgress', (evt) => {
const progress = evt.progress ?? 0
if (progress > lastProgress) {
lastProgress = progress
lastTime = Date.now()
} else if (Date.now() - lastTime > 30000) {
console.warn('Download appears stalled. Tab may be running low on memory.')
}
})
If the browser tab crashes during a model download, the partial download is stored in OPFS. On the
next attempt, ModelManager.downloadModel() will resume from where it left off. Recommend
starting with smaller models (LFM2 350M at ~250MB) before attempting larger ones.
Use coexist for Multi-Model Loading
When loading multiple models (e.g., for voice pipeline), pass coexist: true:
await ModelManager.loadModel('silero-vad-v5', { coexist: true })
await ModelManager.loadModel('sherpa-onnx-whisper-tiny.en', { coexist: true })
await ModelManager.loadModel('lfm2-350m-q4_k_m', { coexist: true })
await ModelManager.loadModel('vits-piper-en_US-lessac-medium', { coexist: true })
Idempotent SDK Initialization
Wrap initialization in a cached-promise pattern so it’s safe to call from multiple components:
let _initPromise: Promise<void> | null = null
export async function initSDK(): Promise<void> {
if (_initPromise) return _initPromise
_initPromise = (async () => {
await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()
await ONNX.register()
RunAnywhere.registerModels(MODELS)
})()
return _initPromise
}
Hosted IDE & Iframe Environments
Replit, CodeSandbox, StackBlitz
These platforms run your app inside an iframe, which has important implications:
| Limitation | Impact | Workaround |
|---|
No SharedArrayBuffer | WASM runs single-threaded (slower) | Access app directly, not via iframe preview |
| Memory constraints | Large models (>200MB) may crash | Use smaller models (350M-500M params, Q4_0) |
| COOP header conflict | same-origin breaks iframe embedding | SDK falls back gracefully; no action needed |
| Preview URL differs | CORS/COEP may behave differently | Test with the published URL, not the preview |
Do not use COEP: require-corp in hosted IDE environments. It will block Vite’s internal
/@fs/ module serving and cause “non-JavaScript MIME type” errors for worker scripts and WASM
glue files. Always use COEP: credentialless.
SPA Routing and Static Assets
When using a custom Express/Node server with SPA catch-all routing, static asset routes must come before the catch-all. Otherwise, .wasm, .js, and worker files get served as index.html:
// WRONG ORDER — catch-all swallows WASM requests:
app.get('*', (req, res) => res.sendFile('index.html'))
app.use(express.static('dist')) // Never reached for .wasm files
// CORRECT ORDER — static files served first:
app.use(
express.static('dist', {
setHeaders: (res, path) => {
if (path.endsWith('.wasm')) {
res.setHeader('Content-Type', 'application/wasm')
}
},
})
)
app.get('*', (req, res) => {
if (!req.path.match(/\.(js|css|wasm|json|png|svg|woff2?)$/)) {
res.sendFile('index.html', { root: 'dist' })
} else {
res.status(404).end()
}
})
Browser-Specific Considerations
Handle Tab Visibility
Cancel in-progress generation when the tab is hidden:
import { TextGeneration } from '@runanywhere/web-llamacpp'
document.addEventListener('visibilitychange', () => {
if (document.hidden) {
TextGeneration.cancel()
}
})
Handle Memory Pressure
if ('deviceMemory' in navigator) {
const memory = (navigator as any).deviceMemory // GB
if (memory < 4) {
console.warn('Low memory device. Use smaller models.')
}
}
Safari Considerations
- OPFS has known reliability issues in Safari — test thoroughly
- WebGPU is not available in Safari (as of early 2026)
- Prefer Chrome/Edge for the best experience
Mobile Browser Considerations
- Mobile browsers have stricter memory limits
- Models larger than 1GB may cause tab crashes
- Use Q4_0 quantization for mobile
- Test on actual mobile devices
Camera Permission Handling
Provide clear error messages for camera permissions (important for VLM):
try {
const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()
} catch (err) {
const msg = (err as Error).message
if (msg.includes('NotAllowed') || msg.includes('Permission')) {
alert('Camera permission denied. Check System Settings → Privacy & Security → Camera.')
} else if (msg.includes('NotFound')) {
alert('No camera found on this device.')
} else if (msg.includes('NotReadable')) {
alert('Camera is in use by another application.')
}
}
Error Handling
Always Handle Errors Gracefully
import { SDKError, SDKErrorCode } from '@runanywhere/web'
import { TextGeneration } from '@runanywhere/web-llamacpp'
async function generateSafely(prompt: string): Promise<string> {
try {
const { stream, result } = await TextGeneration.generateStream(prompt, { maxTokens: 200 })
let text = ''
for await (const token of stream) {
text += token
}
return text
} catch (err) {
if (err instanceof SDKError) {
switch (err.code) {
case SDKErrorCode.ModelNotLoaded:
return 'Please load a model first.'
case SDKErrorCode.GenerationCancelled:
return ''
default:
return 'Sorry, an error occurred. Please try again.'
}
}
return 'An unexpected error occurred.'
}
}
Security & Privacy
All Data Stays Local
The Web SDK runs entirely in the browser via WebAssembly. No data is sent to any server. This is a key advantage for privacy-sensitive applications.
Use Correct Environment Mode
// Development: Full logging
await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
// Production: Minimal logging
await RunAnywhere.initialize({ environment: SDKEnvironment.Production })
Vite Gotchas Summary
| Issue | Fix |
|---|
| WASM files not found | Add optimizeDeps.exclude: ['@runanywhere/web-llamacpp', '@runanywhere/web-onnx'] |
| Web Workers not bundling | Add worker: { format: 'es' } |
| WASM not served in dev | Add assetsInclude: ['**/*.wasm'] |
| VLM Worker TypeScript error | Use @ts-ignore above ?worker&url import |
| WASM missing in production build | Add copyWasmPlugin() to copy WASM to dist/assets/ |
| Single-threaded mode | Add COOP/COEP headers to server.headers |
| WASM served as HTML in prod | Ensure static file serving comes before SPA catch-all route |
| Worker “MIME type” error | Use COEP: credentialless (NOT require-corp); fix SPA catch-all route ordering |
| Camera “source width is 0” | Wait for loadedmetadata event before calling captureFrame() |
Summary Checklist
Install all three packages: @runanywhere/web, web-llamacpp, web-onnx
Register backends with LlamaCPP.register() and ONNX.register()
Choose appropriate model size for browser memory constraints
Use streaming for better perceived performance
Configure Cross-Origin Isolation headers (COEP: credentialless, NOT require-corp)
Add optimizeDeps.exclude in Vite config for WASM packages
Add copyWasmPlugin() to copy WASM files for production builds
Serve static assets (.wasm, .js) BEFORE SPA catch-all routes
Set Content-Type: application/wasm for .wasm files on custom servers
Use OPFS for persistent model storage via ModelManager
Use coexist: true when loading multiple models simultaneously
Wait for loadedmetadata before calling VideoCapture.captureFrame()
For voice pipeline: load all 4 models (VAD + STT + LLM + TTS) — VAD alone is not STT
Handle all error cases gracefully including WASM memory crashes
Show progress during model downloads via EventBus
Batch DOM updates during fast token streaming
Test on target browsers and devices (not just iframe previews)