Early Beta — The Web SDK is in early beta. APIs may change between releases.
Overview
The VLM (Vision Language Model) module enables multimodal inference — you can feed an image and a text prompt to get descriptions, answers, or analysis. It uses llama.cpp’s mtmd (multimodal) backend compiled to WebAssembly and runs inference in a dedicated Web Worker to keep the UI responsive.
Package Imports
VLM classes come from @runanywhere/web-llamacpp, while model management is in @runanywhere/web:
import { ModelManager, ModelCategory, RunAnywhere, LLMFramework } from '@runanywhere/web'
import {
VLMWorkerBridge,
VideoCapture,
LlamaCPP,
startVLMWorkerRuntime,
} from '@runanywhere/web-llamacpp'
Worker Setup
VLM inference runs in a dedicated Web Worker for responsiveness. You need to set up the worker bridge during SDK initialization:
import { RunAnywhere, SDKEnvironment } from '@runanywhere/web'
import { LlamaCPP, VLMWorkerBridge } from '@runanywhere/web-llamacpp'
// Vite bundles the worker as a standalone JS chunk and returns its URL
// @ts-ignore — Vite-specific ?worker&url query
import vlmWorkerUrl from './workers/vlm-worker?worker&url'
await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()
// Wire up VLM worker
VLMWorkerBridge.shared.workerUrl = vlmWorkerUrl
RunAnywhere.setVLMLoader({
get isInitialized() {
return VLMWorkerBridge.shared.isInitialized
},
init: () => VLMWorkerBridge.shared.init(),
loadModel: (params) => VLMWorkerBridge.shared.loadModel(params),
unloadModel: () => VLMWorkerBridge.shared.unloadModel(),
})
import { startVLMWorkerRuntime } from '@runanywhere/web-llamacpp'
startVLMWorkerRuntime()
The ?worker&url import syntax is Vite-specific and not recognized by TypeScript, requiring a
@ts-ignore directive. For other bundlers, you may need to configure the worker URL differently.
“non-JavaScript MIME type” error for VLM worker: If you see Failed to load module script: The server responded with a non-JavaScript MIME type of "text/html", the worker URL is resolving
to your SPA’s index.html instead of the actual JavaScript file. This typically happens when:
- Your server’s catch-all route intercepts
.js file requests
- The worker file isn’t included in the production build output
worker: { format: 'es' } is missing from your Vite config
Ensure static .js files are served before the SPA catch-all route. See
Installation troubleshooting.
Basic Usage
Worker-Based VLM (Recommended)
Use VLMWorkerBridge for the best user experience — inference runs in a Web Worker so the UI stays responsive:
import { ModelManager, ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'
// Ensure VLM model is downloaded and loaded
await ModelManager.downloadModel('lfm2-vl-450m-q4_0')
await ModelManager.loadModel('lfm2-vl-450m-q4_0')
// Capture a frame from the camera
const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()
// IMPORTANT: Wait for video to be ready before capturing
await new Promise<void>((resolve) => {
const video = camera.videoElement
if (video.videoWidth > 0) {
resolve()
} else {
video.addEventListener('loadedmetadata', () => resolve(), { once: true })
}
})
const frame = camera.captureFrame(256) // downscale to 256px max
// Process the frame
if (frame && VLMWorkerBridge.shared.isModelLoaded) {
const result = await VLMWorkerBridge.shared.process(
frame.rgbPixels,
frame.width,
frame.height,
'Describe what you see briefly.',
{ maxTokens: 60, temperature: 0.7 }
)
console.log(result.text)
}
camera.stop()
Always wait for camera readiness before calling captureFrame(). The video stream takes time
to initialize after camera.start() resolves. If you call captureFrame() before the video has
valid dimensions, you’ll get Error: Failed to execute 'getImageData' on 'CanvasRenderingContext2D': The source width is 0. Wait for the loadedmetadata event or check
videoElement.videoWidth > 0.
API Reference
VLMWorkerBridge (Off-Thread, Recommended)
const VLMWorkerBridge: {
readonly shared: VLMWorkerBridge // singleton
/** Set the worker script URL before calling init() */
workerUrl: string
init(wasmJsUrl?: string): Promise<void>
loadModel(params: VLMLoadModelParams): Promise<void>
process(
rgbPixels: Uint8Array,
width: number,
height: number,
prompt: string,
options?: { maxTokens?: number; temperature?: number }
): Promise<{ text: string }>
cancel(): void
unloadModel(): Promise<void>
terminate(): void
readonly isInitialized: boolean
readonly isModelLoaded: boolean
}
VLMWorkerBridge.process() does NOT support systemPrompt in its options. The options only
accept maxTokens and temperature. To include system-level instructions, prepend them to the
prompt parameter directly:// WRONG — systemPrompt is silently ignored:
await VLMWorkerBridge.shared.process(pixels, w, h, 'Describe this.', {
systemPrompt: 'You are a helpful assistant.', // ❌ Not supported
maxTokens: 60,
})
// CORRECT — include instructions in the prompt:
const prompt = 'You are a helpful assistant. Describe what you see in this image.'
await VLMWorkerBridge.shared.process(pixels, w, h, prompt, { maxTokens: 60 })
The standalone VLMGenerationOptions type (below) does include systemPrompt, but that is for
the native VLM API, not for VLMWorkerBridge.process() which uses a simplified options type.
Types
enum VLMImageFormat {
FilePath = 0,
RGBPixels = 1,
Base64 = 2,
}
enum VLMModelFamily {
Auto = 0,
Qwen2VL = 1,
SmolVLM = 2,
LLaVA = 3,
Custom = 99,
}
interface VLMImage {
format: VLMImageFormat
filePath?: string
pixelData?: Uint8Array
base64Data?: string
width?: number
height?: number
}
interface VLMGenerationOptions {
maxTokens?: number // default: 512
temperature?: number // default: 0.7
topP?: number // default: 0.9
systemPrompt?: string
modelFamily?: VLMModelFamily
streaming?: boolean
}
interface VLMGenerationResult {
text: string
promptTokens: number
imageTokens: number
completionTokens: number
totalTokens: number
timeToFirstTokenMs: number
imageEncodeTimeMs: number
totalTimeMs: number
tokensPerSecond: number
hardwareUsed: HardwareAcceleration
}
Camera Integration
VideoCapture
The VideoCapture class is in @runanywhere/web-llamacpp:
import { VideoCapture } from '@runanywhere/web-llamacpp'
const camera = new VideoCapture({
facingMode: 'environment', // or 'user' for selfie camera
})
await camera.start()
// Add the video preview to the DOM
document.getElementById('preview')!.appendChild(camera.videoElement)
// Capture a frame (downscaled to 256px max dimension)
const frame = camera.captureFrame(256)
// frame.rgbPixels: Uint8Array (RGB, no alpha)
// frame.width, frame.height: actual dimensions
// Check state
console.log('Capturing:', camera.isCapturing)
camera.stop()
Waiting for Camera Readiness
The camera stream takes a moment to initialize after start() resolves. Always guard against zero-dimension frames:
const camera = new VideoCapture({ facingMode: 'user' })
await camera.start()
// Option 1: Wait for loadedmetadata event
await new Promise<void>((resolve) => {
const video = camera.videoElement
if (video.videoWidth > 0 && video.videoHeight > 0) {
resolve()
return
}
video.addEventListener('loadedmetadata', () => resolve(), { once: true })
})
// Option 2: Guard in captureFrame calls
const frame = camera.captureFrame(256)
if (!frame || frame.width === 0 || frame.height === 0) {
console.warn('Camera not ready yet, skipping frame')
return
}
Calling captureFrame() before the video stream is fully initialized causes: Failed to execute 'getImageData' on 'CanvasRenderingContext2D': The source width is 0. This commonly happens in
React components that call captureFrame() immediately after start() without waiting for video
dimensions to be valid.
Examples
Live Camera Streaming
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'
const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()
let processing = false
// Describe every 2.5 seconds — skip if still processing
const interval = setInterval(async () => {
if (processing || !VLMWorkerBridge.shared.isModelLoaded) return
processing = true
const frame = camera.captureFrame(256)
if (frame) {
try {
const result = await VLMWorkerBridge.shared.process(
frame.rgbPixels,
frame.width,
frame.height,
'Describe what you see in one sentence.',
{ maxTokens: 30 }
)
document.getElementById('description')!.textContent = result.text
} catch (err) {
const msg = (err as Error).message
if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
// WASM memory crash — skip frame and retry next tick
console.warn('VLM WASM crash, retrying next frame...')
} else {
throw err
}
}
}
processing = false
}, 2500)
// Stop
clearInterval(interval)
camera.stop()
React Component
import { useState, useCallback, useRef, useEffect } from 'react'
import { ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'
export function VisionChat() {
const [description, setDescription] = useState('')
const [isProcessing, setIsProcessing] = useState(false)
const cameraRef = useRef<VideoCapture | null>(null)
const videoRef = useRef<HTMLDivElement>(null)
useEffect(() => {
const camera = new VideoCapture({ facingMode: 'environment' })
cameraRef.current = camera
camera.start().then(() => {
if (videoRef.current) {
const el = camera.videoElement
el.style.width = '100%'
el.style.borderRadius = '12px'
videoRef.current.appendChild(el)
}
})
return () => camera.stop()
}, [])
const handleDescribe = useCallback(async () => {
const frame = cameraRef.current?.captureFrame(256)
if (!frame || !VLMWorkerBridge.shared.isModelLoaded) return
setIsProcessing(true)
try {
const result = await VLMWorkerBridge.shared.process(
frame.rgbPixels,
frame.width,
frame.height,
'Describe what you see.',
{ maxTokens: 80 }
)
setDescription(result.text)
} catch (err) {
const msg = (err as Error).message
if (msg.includes('memory access out of bounds')) {
setDescription('Recovering from memory error... try again.')
} else {
setDescription('Error: ' + msg)
}
} finally {
setIsProcessing(false)
}
}, [])
return (
<div>
<div ref={videoRef} />
<button onClick={handleDescribe} disabled={isProcessing}>
{isProcessing ? 'Analyzing...' : 'Describe'}
</button>
<p>{description}</p>
</div>
)
}
Supported Models
| Model | Architecture | HuggingFace Repo | Size | Quality |
|---|
| LFM2-VL 450M Q4_0 | Liquid AI | runanywhere/LFM2-VL-450M-GGUF | ~500MB | Good — ultra-compact, fast |
| LFM2-VL 450M Q8_0 | Liquid AI | runanywhere/LFM2-VL-450M-GGUF | ~600MB | Good — higher precision |
| SmolVLM 500M Q8_0 | SmolVLM | runanywhere/SmolVLM-500M-Instruct-GGUF | ~500MB | Good |
| Qwen2-VL 2B Q4_K_M | Qwen2VL | runanywhere/Qwen2-VL-2B-Instruct-GGUF | ~1.5GB | Better |
| LLaVA 7B Q4_0 | LLaVA | — | ~4GB | Best |
Registering VLM Models
VLM models require two files: the main model GGUF and a multimodal projector (mmproj) GGUF. Register them using the files array — the first file is the main model, the second is the projector:
import { RunAnywhere, ModelCategory, LLMFramework } from '@runanywhere/web'
RunAnywhere.registerModels([
{
id: 'lfm2-vl-450m-q4_0',
name: 'LFM2-VL 450M Q4_0',
repo: 'runanywhere/LFM2-VL-450M-GGUF',
files: ['LFM2-VL-450M-Q4_0.gguf', 'mmproj-LFM2-VL-450M-Q8_0.gguf'],
framework: LLMFramework.LlamaCpp,
modality: ModelCategory.Multimodal,
memoryRequirement: 500_000_000,
},
])
WASM Memory Crash Handling
VLM image encoding is computationally expensive in WASM and can occasionally trigger memory access out of bounds errors. Always wrap VLM calls in try/catch and handle these gracefully:
try {
const result = await VLMWorkerBridge.shared.process(/* ... */)
} catch (err) {
const msg = (err as Error).message
if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
// Recoverable — the next frame will usually work
console.warn('WASM memory crash, will retry')
}
}
- Use 256x256 capture dimensions — larger images dramatically increase encoding time
- Use VLMWorkerBridge instead of direct VLM — it runs in a Web Worker and won’t freeze the UI
- Limit maxTokens for descriptions (30-60 tokens for quick descriptions, 80+ for detailed analysis)
- Liquid LFM2-VL 450M is the best starting point — smallest multimodal model, fast and memory-efficient (~500MB)
- Handle WASM crashes —
memory access out of bounds is recoverable; display a retry message