Skip to main content
Early Beta — The Web SDK is in early beta. APIs may change between releases.

Overview

The VLM (Vision Language Model) module enables multimodal inference — you can feed an image and a text prompt to get descriptions, answers, or analysis. It uses llama.cpp’s mtmd (multimodal) backend compiled to WebAssembly and runs inference in a dedicated Web Worker to keep the UI responsive.

Package Imports

VLM classes come from @runanywhere/web-llamacpp, while model management is in @runanywhere/web:
import { ModelManager, ModelCategory, RunAnywhere, LLMFramework } from '@runanywhere/web'
import {
  VLMWorkerBridge,
  VideoCapture,
  LlamaCPP,
  startVLMWorkerRuntime,
} from '@runanywhere/web-llamacpp'

Worker Setup

VLM inference runs in a dedicated Web Worker for responsiveness. You need to set up the worker bridge during SDK initialization:
runanywhere.ts
import { RunAnywhere, SDKEnvironment } from '@runanywhere/web'
import { LlamaCPP, VLMWorkerBridge } from '@runanywhere/web-llamacpp'

// Vite bundles the worker as a standalone JS chunk and returns its URL
// @ts-ignore — Vite-specific ?worker&url query
import vlmWorkerUrl from './workers/vlm-worker?worker&url'

await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()

// Wire up VLM worker
VLMWorkerBridge.shared.workerUrl = vlmWorkerUrl
RunAnywhere.setVLMLoader({
  get isInitialized() {
    return VLMWorkerBridge.shared.isInitialized
  },
  init: () => VLMWorkerBridge.shared.init(),
  loadModel: (params) => VLMWorkerBridge.shared.loadModel(params),
  unloadModel: () => VLMWorkerBridge.shared.unloadModel(),
})
workers/vlm-worker.ts
import { startVLMWorkerRuntime } from '@runanywhere/web-llamacpp'

startVLMWorkerRuntime()
The ?worker&url import syntax is Vite-specific and not recognized by TypeScript, requiring a @ts-ignore directive. For other bundlers, you may need to configure the worker URL differently.
“non-JavaScript MIME type” error for VLM worker: If you see Failed to load module script: The server responded with a non-JavaScript MIME type of "text/html", the worker URL is resolving to your SPA’s index.html instead of the actual JavaScript file. This typically happens when:
  1. Your server’s catch-all route intercepts .js file requests
  2. The worker file isn’t included in the production build output
  3. worker: { format: 'es' } is missing from your Vite config
Ensure static .js files are served before the SPA catch-all route. See Installation troubleshooting.

Basic Usage

Use VLMWorkerBridge for the best user experience — inference runs in a Web Worker so the UI stays responsive:
import { ModelManager, ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

// Ensure VLM model is downloaded and loaded
await ModelManager.downloadModel('lfm2-vl-450m-q4_0')
await ModelManager.loadModel('lfm2-vl-450m-q4_0')

// Capture a frame from the camera
const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()

// IMPORTANT: Wait for video to be ready before capturing
await new Promise<void>((resolve) => {
  const video = camera.videoElement
  if (video.videoWidth > 0) {
    resolve()
  } else {
    video.addEventListener('loadedmetadata', () => resolve(), { once: true })
  }
})

const frame = camera.captureFrame(256) // downscale to 256px max

// Process the frame
if (frame && VLMWorkerBridge.shared.isModelLoaded) {
  const result = await VLMWorkerBridge.shared.process(
    frame.rgbPixels,
    frame.width,
    frame.height,
    'Describe what you see briefly.',
    { maxTokens: 60, temperature: 0.7 }
  )
  console.log(result.text)
}

camera.stop()
Always wait for camera readiness before calling captureFrame(). The video stream takes time to initialize after camera.start() resolves. If you call captureFrame() before the video has valid dimensions, you’ll get Error: Failed to execute 'getImageData' on 'CanvasRenderingContext2D': The source width is 0. Wait for the loadedmetadata event or check videoElement.videoWidth > 0.

API Reference

const VLMWorkerBridge: {
  readonly shared: VLMWorkerBridge // singleton

  /** Set the worker script URL before calling init() */
  workerUrl: string

  init(wasmJsUrl?: string): Promise<void>
  loadModel(params: VLMLoadModelParams): Promise<void>

  process(
    rgbPixels: Uint8Array,
    width: number,
    height: number,
    prompt: string,
    options?: { maxTokens?: number; temperature?: number }
  ): Promise<{ text: string }>

  cancel(): void
  unloadModel(): Promise<void>
  terminate(): void

  readonly isInitialized: boolean
  readonly isModelLoaded: boolean
}
VLMWorkerBridge.process() does NOT support systemPrompt in its options. The options only accept maxTokens and temperature. To include system-level instructions, prepend them to the prompt parameter directly:
// WRONG — systemPrompt is silently ignored:
await VLMWorkerBridge.shared.process(pixels, w, h, 'Describe this.', {
  systemPrompt: 'You are a helpful assistant.', // ❌ Not supported
  maxTokens: 60,
})

// CORRECT — include instructions in the prompt:
const prompt = 'You are a helpful assistant. Describe what you see in this image.'
await VLMWorkerBridge.shared.process(pixels, w, h, prompt, { maxTokens: 60 })
The standalone VLMGenerationOptions type (below) does include systemPrompt, but that is for the native VLM API, not for VLMWorkerBridge.process() which uses a simplified options type.

Types

enum VLMImageFormat {
  FilePath = 0,
  RGBPixels = 1,
  Base64 = 2,
}

enum VLMModelFamily {
  Auto = 0,
  Qwen2VL = 1,
  SmolVLM = 2,
  LLaVA = 3,
  Custom = 99,
}

interface VLMImage {
  format: VLMImageFormat
  filePath?: string
  pixelData?: Uint8Array
  base64Data?: string
  width?: number
  height?: number
}

interface VLMGenerationOptions {
  maxTokens?: number // default: 512
  temperature?: number // default: 0.7
  topP?: number // default: 0.9
  systemPrompt?: string
  modelFamily?: VLMModelFamily
  streaming?: boolean
}

interface VLMGenerationResult {
  text: string
  promptTokens: number
  imageTokens: number
  completionTokens: number
  totalTokens: number
  timeToFirstTokenMs: number
  imageEncodeTimeMs: number
  totalTimeMs: number
  tokensPerSecond: number
  hardwareUsed: HardwareAcceleration
}

Camera Integration

VideoCapture

The VideoCapture class is in @runanywhere/web-llamacpp:
import { VideoCapture } from '@runanywhere/web-llamacpp'

const camera = new VideoCapture({
  facingMode: 'environment', // or 'user' for selfie camera
})

await camera.start()

// Add the video preview to the DOM
document.getElementById('preview')!.appendChild(camera.videoElement)

// Capture a frame (downscaled to 256px max dimension)
const frame = camera.captureFrame(256)
// frame.rgbPixels: Uint8Array (RGB, no alpha)
// frame.width, frame.height: actual dimensions

// Check state
console.log('Capturing:', camera.isCapturing)

camera.stop()

Waiting for Camera Readiness

The camera stream takes a moment to initialize after start() resolves. Always guard against zero-dimension frames:
const camera = new VideoCapture({ facingMode: 'user' })
await camera.start()

// Option 1: Wait for loadedmetadata event
await new Promise<void>((resolve) => {
  const video = camera.videoElement
  if (video.videoWidth > 0 && video.videoHeight > 0) {
    resolve()
    return
  }
  video.addEventListener('loadedmetadata', () => resolve(), { once: true })
})

// Option 2: Guard in captureFrame calls
const frame = camera.captureFrame(256)
if (!frame || frame.width === 0 || frame.height === 0) {
  console.warn('Camera not ready yet, skipping frame')
  return
}
Calling captureFrame() before the video stream is fully initialized causes: Failed to execute 'getImageData' on 'CanvasRenderingContext2D': The source width is 0. This commonly happens in React components that call captureFrame() immediately after start() without waiting for video dimensions to be valid.

Examples

Live Camera Streaming

import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()

let processing = false

// Describe every 2.5 seconds — skip if still processing
const interval = setInterval(async () => {
  if (processing || !VLMWorkerBridge.shared.isModelLoaded) return
  processing = true

  const frame = camera.captureFrame(256)
  if (frame) {
    try {
      const result = await VLMWorkerBridge.shared.process(
        frame.rgbPixels,
        frame.width,
        frame.height,
        'Describe what you see in one sentence.',
        { maxTokens: 30 }
      )
      document.getElementById('description')!.textContent = result.text
    } catch (err) {
      const msg = (err as Error).message
      if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
        // WASM memory crash — skip frame and retry next tick
        console.warn('VLM WASM crash, retrying next frame...')
      } else {
        throw err
      }
    }
  }

  processing = false
}, 2500)

// Stop
clearInterval(interval)
camera.stop()

React Component

VisionChat.tsx
import { useState, useCallback, useRef, useEffect } from 'react'
import { ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

export function VisionChat() {
  const [description, setDescription] = useState('')
  const [isProcessing, setIsProcessing] = useState(false)
  const cameraRef = useRef<VideoCapture | null>(null)
  const videoRef = useRef<HTMLDivElement>(null)

  useEffect(() => {
    const camera = new VideoCapture({ facingMode: 'environment' })
    cameraRef.current = camera
    camera.start().then(() => {
      if (videoRef.current) {
        const el = camera.videoElement
        el.style.width = '100%'
        el.style.borderRadius = '12px'
        videoRef.current.appendChild(el)
      }
    })
    return () => camera.stop()
  }, [])

  const handleDescribe = useCallback(async () => {
    const frame = cameraRef.current?.captureFrame(256)
    if (!frame || !VLMWorkerBridge.shared.isModelLoaded) return

    setIsProcessing(true)
    try {
      const result = await VLMWorkerBridge.shared.process(
        frame.rgbPixels,
        frame.width,
        frame.height,
        'Describe what you see.',
        { maxTokens: 80 }
      )
      setDescription(result.text)
    } catch (err) {
      const msg = (err as Error).message
      if (msg.includes('memory access out of bounds')) {
        setDescription('Recovering from memory error... try again.')
      } else {
        setDescription('Error: ' + msg)
      }
    } finally {
      setIsProcessing(false)
    }
  }, [])

  return (
    <div>
      <div ref={videoRef} />
      <button onClick={handleDescribe} disabled={isProcessing}>
        {isProcessing ? 'Analyzing...' : 'Describe'}
      </button>
      <p>{description}</p>
    </div>
  )
}

Supported Models

ModelArchitectureHuggingFace RepoSizeQuality
LFM2-VL 450M Q4_0Liquid AIrunanywhere/LFM2-VL-450M-GGUF~500MBGood — ultra-compact, fast
LFM2-VL 450M Q8_0Liquid AIrunanywhere/LFM2-VL-450M-GGUF~600MBGood — higher precision
SmolVLM 500M Q8_0SmolVLMrunanywhere/SmolVLM-500M-Instruct-GGUF~500MBGood
Qwen2-VL 2B Q4_K_MQwen2VLrunanywhere/Qwen2-VL-2B-Instruct-GGUF~1.5GBBetter
LLaVA 7B Q4_0LLaVA~4GBBest

Registering VLM Models

VLM models require two files: the main model GGUF and a multimodal projector (mmproj) GGUF. Register them using the files array — the first file is the main model, the second is the projector:
import { RunAnywhere, ModelCategory, LLMFramework } from '@runanywhere/web'

RunAnywhere.registerModels([
  {
    id: 'lfm2-vl-450m-q4_0',
    name: 'LFM2-VL 450M Q4_0',
    repo: 'runanywhere/LFM2-VL-450M-GGUF',
    files: ['LFM2-VL-450M-Q4_0.gguf', 'mmproj-LFM2-VL-450M-Q8_0.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Multimodal,
    memoryRequirement: 500_000_000,
  },
])

WASM Memory Crash Handling

VLM image encoding is computationally expensive in WASM and can occasionally trigger memory access out of bounds errors. Always wrap VLM calls in try/catch and handle these gracefully:
try {
  const result = await VLMWorkerBridge.shared.process(/* ... */)
} catch (err) {
  const msg = (err as Error).message
  if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
    // Recoverable — the next frame will usually work
    console.warn('WASM memory crash, will retry')
  }
}

Performance Tips

  • Use 256x256 capture dimensions — larger images dramatically increase encoding time
  • Use VLMWorkerBridge instead of direct VLM — it runs in a Web Worker and won’t freeze the UI
  • Limit maxTokens for descriptions (30-60 tokens for quick descriptions, 80+ for detailed analysis)
  • Liquid LFM2-VL 450M is the best starting point — smallest multimodal model, fast and memory-efficient (~500MB)
  • Handle WASM crashesmemory access out of bounds is recoverable; display a retry message