Vision Language Models

Early Beta — The Web SDK is in early beta. APIs may change between releases.

Overview

The VLM (Vision Language Model) module enables multimodal inference — you can feed an image and a text prompt to get descriptions, answers, or analysis. It uses llama.cpp’s mtmd (multimodal) backend compiled to WebAssembly and runs inference in a dedicated Web Worker to keep the UI responsive.

Package Imports

VLM classes come from @runanywhere/web-llamacpp, while model management is in @runanywhere/web:

import { ModelManager, ModelCategory, RunAnywhere, LLMFramework } from '@runanywhere/web'
import {
  VLMWorkerBridge,
  VideoCapture,
  LlamaCPP,
  startVLMWorkerRuntime,
} from '@runanywhere/web-llamacpp'

Worker Setup

VLM inference runs in a dedicated Web Worker for responsiveness. You need to set up the worker bridge during SDK initialization:

runanywhere.ts

import { RunAnywhere, SDKEnvironment } from '@runanywhere/web'
import { LlamaCPP, VLMWorkerBridge } from '@runanywhere/web-llamacpp'

// Vite bundles the worker as a standalone JS chunk and returns its URL
// @ts-ignore — Vite-specific ?worker&url query
import vlmWorkerUrl from './workers/vlm-worker?worker&url'

await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()

// Wire up VLM worker
VLMWorkerBridge.shared.workerUrl = vlmWorkerUrl
RunAnywhere.setVLMLoader({
  get isInitialized() {
    return VLMWorkerBridge.shared.isInitialized
  },
  init: () => VLMWorkerBridge.shared.init(),
  loadModel: (params) => VLMWorkerBridge.shared.loadModel(params),
  unloadModel: () => VLMWorkerBridge.shared.unloadModel(),
})

workers/vlm-worker.ts

import { startVLMWorkerRuntime } from '@runanywhere/web-llamacpp'

startVLMWorkerRuntime()

The ?worker&url import syntax is Vite-specific and not recognized by TypeScript, requiring a @ts-ignore directive. For other bundlers, you may need to configure the worker URL differently.

“non-JavaScript MIME type” error for VLM worker: If you see Failed to load module script: The server responded with a non-JavaScript MIME type of "text/html", the worker URL is resolving to your SPA’s index.html instead of the actual JavaScript file. This typically happens when:

Your server’s catch-all route intercepts .js file requests
The worker file isn’t included in the production build output
worker: { format: 'es' } is missing from your Vite config

Ensure static .js files are served before the SPA catch-all route. See Installation troubleshooting.

Basic Usage

Worker-Based VLM (Recommended)

Use VLMWorkerBridge for the best user experience — inference runs in a Web Worker so the UI stays responsive:

import { ModelManager, ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

// Ensure VLM model is downloaded and loaded
await ModelManager.downloadModel('lfm2-vl-450m-q4_0')
await ModelManager.loadModel('lfm2-vl-450m-q4_0')

// Capture a frame from the camera
const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()

// IMPORTANT: Wait for video to be ready before capturing
await new Promise<void>((resolve) => {
  const video = camera.videoElement
  if (video.videoWidth > 0) {
    resolve()
  } else {
    video.addEventListener('loadedmetadata', () => resolve(), { once: true })
  }
})

const frame = camera.captureFrame(256) // downscale to 256px max

// Process the frame
if (frame && VLMWorkerBridge.shared.isModelLoaded) {
  const result = await VLMWorkerBridge.shared.process(
    frame.rgbPixels,
    frame.width,
    frame.height,
    'Describe what you see briefly.',
    { maxTokens: 60, temperature: 0.7 }
  )
  console.log(result.text)
}

camera.stop()

Always wait for camera readiness before calling captureFrame(). The video stream takes time to initialize after camera.start() resolves. If you call captureFrame() before the video has valid dimensions, you’ll get Error: Failed to execute 'getImageData' on 'CanvasRenderingContext2D': The source width is 0. Wait for the loadedmetadata event or check videoElement.videoWidth > 0.

API Reference

VLMWorkerBridge (Off-Thread, Recommended)

const VLMWorkerBridge: {
  readonly shared: VLMWorkerBridge // singleton

  /** Set the worker script URL before calling init() */
  workerUrl: string

  init(wasmJsUrl?: string): Promise<void>
  loadModel(params: VLMLoadModelParams): Promise<void>

  process(
    rgbPixels: Uint8Array,
    width: number,
    height: number,
    prompt: string,
    options?: { maxTokens?: number; temperature?: number }
  ): Promise<{ text: string }>

  cancel(): void
  unloadModel(): Promise<void>
  terminate(): void

  readonly isInitialized: boolean
  readonly isModelLoaded: boolean
}

VLMWorkerBridge.process() does NOT support systemPrompt in its options. The options only accept maxTokens and temperature. To include system-level instructions, prepend them to the prompt parameter directly:

// WRONG — systemPrompt is silently ignored:
await VLMWorkerBridge.shared.process(pixels, w, h, 'Describe this.', {
  systemPrompt: 'You are a helpful assistant.', // ❌ Not supported
  maxTokens: 60,
})

// CORRECT — include instructions in the prompt:
const prompt = 'You are a helpful assistant. Describe what you see in this image.'
await VLMWorkerBridge.shared.process(pixels, w, h, prompt, { maxTokens: 60 })

The standalone VLMGenerationOptions type (below) does include systemPrompt, but that is for the native VLM API, not for VLMWorkerBridge.process() which uses a simplified options type.

Types

enum VLMImageFormat {
  FilePath = 0,
  RGBPixels = 1,
  Base64 = 2,
}

enum VLMModelFamily {
  Auto = 0,
  Qwen2VL = 1,
  SmolVLM = 2,
  LLaVA = 3,
  Custom = 99,
}

interface VLMImage {
  format: VLMImageFormat
  filePath?: string
  pixelData?: Uint8Array
  base64Data?: string
  width?: number
  height?: number
}

interface VLMGenerationOptions {
  maxTokens?: number // default: 512
  temperature?: number // default: 0.7
  topP?: number // default: 0.9
  systemPrompt?: string
  modelFamily?: VLMModelFamily
  streaming?: boolean
}

interface VLMGenerationResult {
  text: string
  promptTokens: number
  imageTokens: number
  completionTokens: number
  totalTokens: number
  timeToFirstTokenMs: number
  imageEncodeTimeMs: number
  totalTimeMs: number
  tokensPerSecond: number
  hardwareUsed: HardwareAcceleration
}

Camera Integration

VideoCapture

The VideoCapture class is in @runanywhere/web-llamacpp:

import { VideoCapture } from '@runanywhere/web-llamacpp'

const camera = new VideoCapture({
  facingMode: 'environment', // or 'user' for selfie camera
})

await camera.start()

// Add the video preview to the DOM
document.getElementById('preview')!.appendChild(camera.videoElement)

// Capture a frame (downscaled to 256px max dimension)
const frame = camera.captureFrame(256)
// frame.rgbPixels: Uint8Array (RGB, no alpha)
// frame.width, frame.height: actual dimensions

// Check state
console.log('Capturing:', camera.isCapturing)

camera.stop()

Waiting for Camera Readiness

The camera stream takes a moment to initialize after start() resolves. Always guard against zero-dimension frames:

const camera = new VideoCapture({ facingMode: 'user' })
await camera.start()

// Option 1: Wait for loadedmetadata event
await new Promise<void>((resolve) => {
  const video = camera.videoElement
  if (video.videoWidth > 0 && video.videoHeight > 0) {
    resolve()
    return
  }
  video.addEventListener('loadedmetadata', () => resolve(), { once: true })
})

// Option 2: Guard in captureFrame calls
const frame = camera.captureFrame(256)
if (!frame || frame.width === 0 || frame.height === 0) {
  console.warn('Camera not ready yet, skipping frame')
  return
}

Calling captureFrame() before the video stream is fully initialized causes: Failed to execute 'getImageData' on 'CanvasRenderingContext2D': The source width is 0. This commonly happens in React components that call captureFrame() immediately after start() without waiting for video dimensions to be valid.

Examples

Live Camera Streaming

import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()

let processing = false

// Describe every 2.5 seconds — skip if still processing
const interval = setInterval(async () => {
  if (processing || !VLMWorkerBridge.shared.isModelLoaded) return
  processing = true

  const frame = camera.captureFrame(256)
  if (frame) {
    try {
      const result = await VLMWorkerBridge.shared.process(
        frame.rgbPixels,
        frame.width,
        frame.height,
        'Describe what you see in one sentence.',
        { maxTokens: 30 }
      )
      document.getElementById('description')!.textContent = result.text
    } catch (err) {
      const msg = (err as Error).message
      if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
        // WASM memory crash — skip frame and retry next tick
        console.warn('VLM WASM crash, retrying next frame...')
      } else {
        throw err
      }
    }
  }

  processing = false
}, 2500)

// Stop
clearInterval(interval)
camera.stop()

React Component

VisionChat.tsx

import { useState, useCallback, useRef, useEffect } from 'react'
import { ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

export function VisionChat() {
  const [description, setDescription] = useState('')
  const [isProcessing, setIsProcessing] = useState(false)
  const cameraRef = useRef<VideoCapture | null>(null)
  const videoRef = useRef<HTMLDivElement>(null)

  useEffect(() => {
    const camera = new VideoCapture({ facingMode: 'environment' })
    cameraRef.current = camera
    camera.start().then(() => {
      if (videoRef.current) {
        const el = camera.videoElement
        el.style.width = '100%'
        el.style.borderRadius = '12px'
        videoRef.current.appendChild(el)
      }
    })
    return () => camera.stop()
  }, [])

  const handleDescribe = useCallback(async () => {
    const frame = cameraRef.current?.captureFrame(256)
    if (!frame || !VLMWorkerBridge.shared.isModelLoaded) return

    setIsProcessing(true)
    try {
      const result = await VLMWorkerBridge.shared.process(
        frame.rgbPixels,
        frame.width,
        frame.height,
        'Describe what you see.',
        { maxTokens: 80 }
      )
      setDescription(result.text)
    } catch (err) {
      const msg = (err as Error).message
      if (msg.includes('memory access out of bounds')) {
        setDescription('Recovering from memory error... try again.')
      } else {
        setDescription('Error: ' + msg)
      }
    } finally {
      setIsProcessing(false)
    }
  }, [])

  return (
    <div>
      <div ref={videoRef} />
      <button onClick={handleDescribe} disabled={isProcessing}>
        {isProcessing ? 'Analyzing...' : 'Describe'}
      </button>
      <p>{description}</p>
    </div>
  )
}

Supported Models

Model	Architecture	HuggingFace Repo	Size	Quality
LFM2-VL 450M Q4_0	Liquid AI	`runanywhere/LFM2-VL-450M-GGUF`	~500MB	Good — ultra-compact, fast
LFM2-VL 450M Q8_0	Liquid AI	`runanywhere/LFM2-VL-450M-GGUF`	~600MB	Good — higher precision
SmolVLM 500M Q8_0	SmolVLM	`runanywhere/SmolVLM-500M-Instruct-GGUF`	~500MB	Good
Qwen2-VL 2B Q4_K_M	Qwen2VL	`runanywhere/Qwen2-VL-2B-Instruct-GGUF`	~1.5GB	Better
LLaVA 7B Q4_0	LLaVA	—	~4GB	Best

Registering VLM Models

VLM models require two files: the main model GGUF and a multimodal projector (mmproj) GGUF. Register them using the files array — the first file is the main model, the second is the projector:

import { RunAnywhere, ModelCategory, LLMFramework } from '@runanywhere/web'

RunAnywhere.registerModels([
  {
    id: 'lfm2-vl-450m-q4_0',
    name: 'LFM2-VL 450M Q4_0',
    repo: 'runanywhere/LFM2-VL-450M-GGUF',
    files: ['LFM2-VL-450M-Q4_0.gguf', 'mmproj-LFM2-VL-450M-Q8_0.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Multimodal,
    memoryRequirement: 500_000_000,
  },
])

WASM Memory Crash Handling

VLM image encoding is computationally expensive in WASM and can occasionally trigger memory access out of bounds errors. Always wrap VLM calls in try/catch and handle these gracefully:

try {
  const result = await VLMWorkerBridge.shared.process(/* ... */)
} catch (err) {
  const msg = (err as Error).message
  if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
    // Recoverable — the next frame will usually work
    console.warn('WASM memory crash, will retry')
  }
}

Performance Tips

Use 256x256 capture dimensions — larger images dramatically increase encoding time
Use VLMWorkerBridge instead of direct VLM — it runs in a Web Worker and won’t freeze the UI
Limit maxTokens for descriptions (30-60 tokens for quick descriptions, 80+ for detailed analysis)
Liquid LFM2-VL 450M is the best starting point — smallest multimodal model, fast and memory-efficient (~500MB)
Handle WASM crashes — memory access out of bounds is recoverable; display a retry message

LLM Generation

Text-only generation

Tool Calling

Function calling with LLMs

Best Practices

Performance optimization

Getting Started

Swift SDK

Kotlin SDK

React Native SDK

Flutter SDK

Web SDK

Vibe Coding

Overview

Package Imports

Worker Setup

Basic Usage

Worker-Based VLM (Recommended)

API Reference

VLMWorkerBridge (Off-Thread, Recommended)

Types

Camera Integration

VideoCapture

Waiting for Camera Readiness

Examples

Live Camera Streaming

React Component

Supported Models

Registering VLM Models

WASM Memory Crash Handling

Performance Tips

LLM Generation

Tool Calling

Best Practices

​Overview

​Package Imports

​Worker Setup

​Basic Usage

​Worker-Based VLM (Recommended)

​API Reference

​VLMWorkerBridge (Off-Thread, Recommended)

​Types

​Camera Integration

​VideoCapture

​Waiting for Camera Readiness

​Examples

​Live Camera Streaming

​React Component

​Supported Models

​Registering VLM Models

​WASM Memory Crash Handling

​Performance Tips

​Related

LLM Generation

Tool Calling

Best Practices

Overview

Package Imports

Worker Setup

Basic Usage

Worker-Based VLM (Recommended)

API Reference

VLMWorkerBridge (Off-Thread, Recommended)

Types

Camera Integration

VideoCapture

Waiting for Camera Readiness

Examples

Live Camera Streaming

React Component

Supported Models

Registering VLM Models

WASM Memory Crash Handling

Performance Tips

Related