Skip to main content
Early Beta — The Web SDK is in early beta. APIs may change between releases.

Overview

This guide covers best practices for building performant, reliable, and user-friendly AI applications with the RunAnywhere Web SDK in the browser.

Model Selection

Choose the Right Model Size

Model SizeRAM RequiredUse CaseSpeed
360M-500M (Q4)~300-500MBQuick responses, chatVery Fast
1B-3B (Q4)1-2GBBalanced quality/speedFast
7B (Q4)4-5GBHigh qualitySlower
Browser memory is more limited than native apps. Models larger than 2GB may cause tab crashes on devices with limited RAM. Start with smaller models and test on target devices.

Quantization Trade-offs

QuantizationQualitySizeSpeed
Q8_0BestLargestSlower
Q6_KGreatLargeFast
Q4_K_MGoodMediumFaster
Q4_0AcceptableSmallFastest
For browser use, Q4_0 and Q4_K_M offer the best balance of quality and memory efficiency. Start with smaller quantizations and only increase if output quality is insufficient.

Performance Optimization

Use Streaming for Better UX

import { TextGeneration } from '@runanywhere/web-llamacpp'

// Bad: User waits for entire response
const result = await TextGeneration.generate(prompt, { maxTokens: 500 })
document.getElementById('output')!.textContent = result.text

// Good: User sees response as it's generated
const { stream } = await TextGeneration.generateStream(prompt, { maxTokens: 500 })
let text = ''
for await (const token of stream) {
  text += token
  document.getElementById('output')!.textContent = text
}

Enable Cross-Origin Isolation

Multi-threaded WASM is significantly faster. Always configure COOP/COEP headers:
import { LlamaCPP } from '@runanywhere/web-llamacpp'

// Check after backend registration
await LlamaCPP.register()
console.log('Acceleration:', LlamaCPP.accelerationMode) // 'webgpu' or 'cpu'

if (!crossOriginIsolated) {
  console.warn('Running in single-threaded mode. Add COOP/COEP headers for better performance.')
}

Exclude WASM Packages from Vite Pre-Bundling

This is the most common gotcha with Vite:
vite.config.ts
export default defineConfig({
  optimizeDeps: {
    exclude: ['@runanywhere/web-llamacpp', '@runanywhere/web-onnx'],
  },
})
Without this, import.meta.url resolves to the wrong paths and WASM files won’t be found.

Limit Token Generation

import { TextGeneration } from '@runanywhere/web-llamacpp'

// For quick responses
const { stream } = await TextGeneration.generateStream(prompt, {
  maxTokens: 100,
  temperature: 0.5,
})

// For detailed responses
const { stream: detailedStream } = await TextGeneration.generateStream(prompt, {
  maxTokens: 500,
  temperature: 0.7,
})

Batch DOM Updates

For fast token generation, throttle UI updates to avoid rendering bottlenecks:
let pending = ''
let frameId: number | null = null

function appendToken(token: string) {
  pending += token
  if (!frameId) {
    frameId = requestAnimationFrame(() => {
      document.getElementById('output')!.textContent = pending
      frameId = null
    })
  }
}

Model Management

Use OPFS for Persistent Storage

Download models to OPFS so they persist across browser sessions:
import { RunAnywhere, ModelManager, ModelCategory, LLMFramework, EventBus } from '@runanywhere/web'

RunAnywhere.registerModels([
  {
    id: 'lfm2-350m-q4_k_m',
    name: 'LFM2 350M',
    repo: 'LiquidAI/LFM2-350M-GGUF',
    files: ['LFM2-350M-Q4_K_M.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Language,
    memoryRequirement: 250_000_000,
  },
])

// First visit: download model
await ModelManager.downloadModel('lfm2-350m-q4_k_m')
await ModelManager.loadModel('lfm2-350m-q4_k_m')

// Subsequent visits: model loads from OPFS (no re-download)

Show Download Progress

import { EventBus } from '@runanywhere/web'

EventBus.shared.on('model.downloadProgress', (evt) => {
  const percent = ((evt.progress ?? 0) * 100).toFixed(0)
  document.getElementById('progress')!.textContent = `Downloading: ${percent}%`
})

Handle Large Model Downloads

Models over ~200MB can crash the browser tab, especially on memory-constrained devices. Mitigations:
// Check available memory before downloading
if ('deviceMemory' in navigator) {
  const memoryGB = (navigator as any).deviceMemory
  const modelSizeMB = 250

  if (memoryGB < 4 && modelSizeMB > 200) {
    console.warn('Low memory device — large model downloads may fail.')
  }
}

// Monitor for download stalls (no progress for 30s may indicate trouble)
let lastProgress = 0
let lastTime = Date.now()

EventBus.shared.on('model.downloadProgress', (evt) => {
  const progress = evt.progress ?? 0
  if (progress > lastProgress) {
    lastProgress = progress
    lastTime = Date.now()
  } else if (Date.now() - lastTime > 30000) {
    console.warn('Download appears stalled. Tab may be running low on memory.')
  }
})
If the browser tab crashes during a model download, the partial download is stored in OPFS. On the next attempt, ModelManager.downloadModel() will resume from where it left off. Recommend starting with smaller models (LFM2 350M at ~250MB) before attempting larger ones.

Use coexist for Multi-Model Loading

When loading multiple models (e.g., for voice pipeline), pass coexist: true:
await ModelManager.loadModel('silero-vad-v5', { coexist: true })
await ModelManager.loadModel('sherpa-onnx-whisper-tiny.en', { coexist: true })
await ModelManager.loadModel('lfm2-350m-q4_k_m', { coexist: true })
await ModelManager.loadModel('vits-piper-en_US-lessac-medium', { coexist: true })

Idempotent SDK Initialization

Wrap initialization in a cached-promise pattern so it’s safe to call from multiple components:
let _initPromise: Promise<void> | null = null

export async function initSDK(): Promise<void> {
  if (_initPromise) return _initPromise
  _initPromise = (async () => {
    await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
    await LlamaCPP.register()
    await ONNX.register()
    RunAnywhere.registerModels(MODELS)
  })()
  return _initPromise
}

Hosted IDE & Iframe Environments

Replit, CodeSandbox, StackBlitz

These platforms run your app inside an iframe, which has important implications:
LimitationImpactWorkaround
No SharedArrayBufferWASM runs single-threaded (slower)Access app directly, not via iframe preview
Memory constraintsLarge models (>200MB) may crashUse smaller models (350M-500M params, Q4_0)
COOP header conflictsame-origin breaks iframe embeddingSDK falls back gracefully; no action needed
Preview URL differsCORS/COEP may behave differentlyTest with the published URL, not the preview
Do not use COEP: require-corp in hosted IDE environments. It will block Vite’s internal /@fs/ module serving and cause “non-JavaScript MIME type” errors for worker scripts and WASM glue files. Always use COEP: credentialless.

SPA Routing and Static Assets

When using a custom Express/Node server with SPA catch-all routing, static asset routes must come before the catch-all. Otherwise, .wasm, .js, and worker files get served as index.html:
// WRONG ORDER — catch-all swallows WASM requests:
app.get('*', (req, res) => res.sendFile('index.html'))
app.use(express.static('dist')) // Never reached for .wasm files

// CORRECT ORDER — static files served first:
app.use(
  express.static('dist', {
    setHeaders: (res, path) => {
      if (path.endsWith('.wasm')) {
        res.setHeader('Content-Type', 'application/wasm')
      }
    },
  })
)
app.get('*', (req, res) => {
  if (!req.path.match(/\.(js|css|wasm|json|png|svg|woff2?)$/)) {
    res.sendFile('index.html', { root: 'dist' })
  } else {
    res.status(404).end()
  }
})

Browser-Specific Considerations

Handle Tab Visibility

Cancel in-progress generation when the tab is hidden:
import { TextGeneration } from '@runanywhere/web-llamacpp'

document.addEventListener('visibilitychange', () => {
  if (document.hidden) {
    TextGeneration.cancel()
  }
})

Handle Memory Pressure

if ('deviceMemory' in navigator) {
  const memory = (navigator as any).deviceMemory // GB
  if (memory < 4) {
    console.warn('Low memory device. Use smaller models.')
  }
}

Safari Considerations

  • OPFS has known reliability issues in Safari — test thoroughly
  • WebGPU is not available in Safari (as of early 2026)
  • Prefer Chrome/Edge for the best experience

Mobile Browser Considerations

  • Mobile browsers have stricter memory limits
  • Models larger than 1GB may cause tab crashes
  • Use Q4_0 quantization for mobile
  • Test on actual mobile devices

Camera Permission Handling

Provide clear error messages for camera permissions (important for VLM):
try {
  const camera = new VideoCapture({ facingMode: 'environment' })
  await camera.start()
} catch (err) {
  const msg = (err as Error).message
  if (msg.includes('NotAllowed') || msg.includes('Permission')) {
    alert('Camera permission denied. Check System Settings → Privacy & Security → Camera.')
  } else if (msg.includes('NotFound')) {
    alert('No camera found on this device.')
  } else if (msg.includes('NotReadable')) {
    alert('Camera is in use by another application.')
  }
}

Error Handling

Always Handle Errors Gracefully

import { SDKError, SDKErrorCode } from '@runanywhere/web'
import { TextGeneration } from '@runanywhere/web-llamacpp'

async function generateSafely(prompt: string): Promise<string> {
  try {
    const { stream, result } = await TextGeneration.generateStream(prompt, { maxTokens: 200 })
    let text = ''
    for await (const token of stream) {
      text += token
    }
    return text
  } catch (err) {
    if (err instanceof SDKError) {
      switch (err.code) {
        case SDKErrorCode.ModelNotLoaded:
          return 'Please load a model first.'
        case SDKErrorCode.GenerationCancelled:
          return ''
        default:
          return 'Sorry, an error occurred. Please try again.'
      }
    }
    return 'An unexpected error occurred.'
  }
}

Security & Privacy

All Data Stays Local

The Web SDK runs entirely in the browser via WebAssembly. No data is sent to any server. This is a key advantage for privacy-sensitive applications.

Use Correct Environment Mode

// Development: Full logging
await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })

// Production: Minimal logging
await RunAnywhere.initialize({ environment: SDKEnvironment.Production })

Vite Gotchas Summary

IssueFix
WASM files not foundAdd optimizeDeps.exclude: ['@runanywhere/web-llamacpp', '@runanywhere/web-onnx']
Web Workers not bundlingAdd worker: { format: 'es' }
WASM not served in devAdd assetsInclude: ['**/*.wasm']
VLM Worker TypeScript errorUse @ts-ignore above ?worker&url import
WASM missing in production buildAdd copyWasmPlugin() to copy WASM to dist/assets/
Single-threaded modeAdd COOP/COEP headers to server.headers
WASM served as HTML in prodEnsure static file serving comes before SPA catch-all route
Worker “MIME type” errorUse COEP: credentialless (NOT require-corp); fix SPA catch-all route ordering
Camera “source width is 0”Wait for loadedmetadata event before calling captureFrame()

Summary Checklist

Install all three packages: @runanywhere/web, web-llamacpp, web-onnx
Register backends with LlamaCPP.register() and ONNX.register()
Choose appropriate model size for browser memory constraints
Use streaming for better perceived performance
Configure Cross-Origin Isolation headers (COEP: credentialless, NOT require-corp)
Add optimizeDeps.exclude in Vite config for WASM packages
Add copyWasmPlugin() to copy WASM files for production builds
Serve static assets (.wasm, .js) BEFORE SPA catch-all routes
Set Content-Type: application/wasm for .wasm files on custom servers
Use OPFS for persistent model storage via ModelManager
Use coexist: true when loading multiple models simultaneously
Wait for loadedmetadata before calling VideoCapture.captureFrame()
For voice pipeline: load all 4 models (VAD + STT + LLM + TTS) — VAD alone is not STT
Handle all error cases gracefully including WASM memory crashes
Show progress during model downloads via EventBus
Batch DOM updates during fast token streaming
Test on target browsers and devices (not just iframe previews)