> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runanywhere.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Vision Language Models

> Multimodal image + text inference in the browser

<Note>**Early Beta** -- The Web SDK is in early beta. APIs may change between releases.</Note>

## Overview

The VLM (Vision Language Model) module enables multimodal inference -- you can feed an image and a text prompt to get descriptions, answers, or analysis. It uses llama.cpp's mtmd (multimodal) backend compiled to WebAssembly and runs inference in a dedicated **Web Worker** to keep the UI responsive.

```mermaid theme={null}
graph LR
    A(Camera / Image) --> B(VideoCapture)
    B --> C(VLMWorkerBridge)
    C --> D(Web Worker + WASM)
    D --> E(Text Output)

    style A fill:#334155,color:#fff,stroke:#334155
    style B fill:#475569,color:#fff,stroke:#475569
    style C fill:#ff6900,color:#fff,stroke:#ff6900
    style D fill:#fb2c36,color:#fff,stroke:#fb2c36
    style E fill:#334155,color:#fff,stroke:#334155
```

## Package Imports

VLM classes come from `@runanywhere/web-llamacpp`, while model management is in `@runanywhere/web`:

```typescript theme={null}
import { ModelManager, ModelCategory, RunAnywhere, LLMFramework } from '@runanywhere/web'
import {
  VLMWorkerBridge,
  VideoCapture,
  LlamaCPP,
  startVLMWorkerRuntime,
} from '@runanywhere/web-llamacpp'
```

## Worker Setup

VLM inference runs in a dedicated Web Worker for responsiveness. You need to set up the worker bridge during SDK initialization:

```typescript runanywhere.ts theme={null}
import { RunAnywhere, SDKEnvironment } from '@runanywhere/web'
import { LlamaCPP, VLMWorkerBridge } from '@runanywhere/web-llamacpp'

// Vite bundles the worker as a standalone JS chunk and returns its URL
// @ts-ignore — Vite-specific ?worker&url query
import vlmWorkerUrl from './workers/vlm-worker?worker&url'

await RunAnywhere.initialize({ environment: SDKEnvironment.Development, debug: true })
await LlamaCPP.register()

// Wire up VLM worker
VLMWorkerBridge.shared.workerUrl = vlmWorkerUrl
RunAnywhere.setVLMLoader({
  get isInitialized() {
    return VLMWorkerBridge.shared.isInitialized
  },
  init: () => VLMWorkerBridge.shared.init(),
  loadModel: (params) => VLMWorkerBridge.shared.loadModel(params),
  unloadModel: () => VLMWorkerBridge.shared.unloadModel(),
})
```

```typescript workers/vlm-worker.ts theme={null}
import { startVLMWorkerRuntime } from '@runanywhere/web-llamacpp'

startVLMWorkerRuntime()
```

<Warning>
  The `?worker&url` import syntax is Vite-specific and not recognized by TypeScript, requiring a
  `@ts-ignore` directive. For other bundlers, you may need to configure the worker URL differently.
</Warning>

<Warning>
  **"non-JavaScript MIME type" error for VLM worker:** If you see `Failed to load module script:
      The server responded with a non-JavaScript MIME type of "text/html"`, the worker URL is resolving
  to your SPA's `index.html` instead of the actual JavaScript file. This typically happens when:

  1. Your server's catch-all route intercepts `.js` file requests
  2. The worker file isn't included in the production build output
  3. `worker: { format: 'es' }` is missing from your Vite config

  Ensure static `.js` files are served before the SPA catch-all route. See
  [Installation troubleshooting](/web/installation#vlm-worker-fails-with-non-javascript-mime-type).
</Warning>

## Basic Usage

### Worker-Based VLM (Recommended)

Use `VLMWorkerBridge` for the best user experience -- inference runs in a Web Worker so the UI stays responsive:

```typescript theme={null}
import { ModelManager, ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

// Ensure VLM model is downloaded and loaded
await ModelManager.downloadModel('lfm2-vl-450m-q4_0')
await ModelManager.loadModel('lfm2-vl-450m-q4_0')

// Capture a frame from the camera
const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()

// IMPORTANT: Wait for video to be ready before capturing
await new Promise<void>((resolve) => {
  const video = camera.videoElement
  if (video.videoWidth > 0) {
    resolve()
  } else {
    video.addEventListener('loadedmetadata', () => resolve(), { once: true })
  }
})

const frame = camera.captureFrame(256) // downscale to 256px max

// Process the frame
if (frame && VLMWorkerBridge.shared.isModelLoaded) {
  const result = await VLMWorkerBridge.shared.process(
    frame.rgbPixels,
    frame.width,
    frame.height,
    'Describe what you see briefly.',
    { maxTokens: 60, temperature: 0.7 }
  )
  console.log(result.text)
}

camera.stop()
```

<Warning>
  **Always wait for camera readiness before calling `captureFrame()`.** The video stream takes time
  to initialize after `camera.start()` resolves. If you call `captureFrame()` before the video has
  valid dimensions, you'll get `Error: Failed to execute 'getImageData' on
      'CanvasRenderingContext2D': The source width is 0.` Wait for the `loadedmetadata` event or check
  `videoElement.videoWidth > 0`.
</Warning>

## API Reference

### VLMWorkerBridge (Off-Thread, Recommended)

```typescript theme={null}
const VLMWorkerBridge: {
  readonly shared: VLMWorkerBridge // singleton

  /** Set the worker script URL before calling init() */
  workerUrl: string

  init(wasmJsUrl?: string): Promise<void>
  loadModel(params: VLMLoadModelParams): Promise<void>

  process(
    rgbPixels: Uint8Array,
    width: number,
    height: number,
    prompt: string,
    options?: { maxTokens?: number; temperature?: number }
  ): Promise<{ text: string }>

  cancel(): void
  unloadModel(): Promise<void>
  terminate(): void

  readonly isInitialized: boolean
  readonly isModelLoaded: boolean
}
```

<Warning>
  **`VLMWorkerBridge.process()` does NOT support `systemPrompt` in its options.** The options only
  accept `maxTokens` and `temperature`. To include system-level instructions, prepend them to the
  `prompt` parameter directly:

  ```typescript theme={null}
  // WRONG — systemPrompt is silently ignored:
  await VLMWorkerBridge.shared.process(pixels, w, h, 'Describe this.', {
    systemPrompt: 'You are a helpful assistant.', // ❌ Not supported
    maxTokens: 60,
  })

  // CORRECT — include instructions in the prompt:
  const prompt = 'You are a helpful assistant. Describe what you see in this image.'
  await VLMWorkerBridge.shared.process(pixels, w, h, prompt, { maxTokens: 60 })
  ```

  The standalone `VLMGenerationOptions` type (below) does include `systemPrompt`, but that is for
  the native VLM API, not for `VLMWorkerBridge.process()` which uses a simplified options type.
</Warning>

### Types

```typescript theme={null}
enum VLMImageFormat {
  FilePath = 0,
  RGBPixels = 1,
  Base64 = 2,
}

enum VLMModelFamily {
  Auto = 0,
  Qwen2VL = 1,
  SmolVLM = 2,
  LLaVA = 3,
  Custom = 99,
}

interface VLMImage {
  format: VLMImageFormat
  filePath?: string
  pixelData?: Uint8Array
  base64Data?: string
  width?: number
  height?: number
}

interface VLMGenerationOptions {
  maxTokens?: number // default: 512
  temperature?: number // default: 0.7
  topP?: number // default: 0.9
  systemPrompt?: string
  modelFamily?: VLMModelFamily
  streaming?: boolean
}

interface VLMGenerationResult {
  text: string
  promptTokens: number
  imageTokens: number
  completionTokens: number
  totalTokens: number
  timeToFirstTokenMs: number
  imageEncodeTimeMs: number
  totalTimeMs: number
  tokensPerSecond: number
  hardwareUsed: HardwareAcceleration
}
```

## Camera Integration

### VideoCapture

The `VideoCapture` class is in `@runanywhere/web-llamacpp`:

```typescript theme={null}
import { VideoCapture } from '@runanywhere/web-llamacpp'

const camera = new VideoCapture({
  facingMode: 'environment', // or 'user' for selfie camera
})

await camera.start()

// Add the video preview to the DOM
document.getElementById('preview')!.appendChild(camera.videoElement)

// Capture a frame (downscaled to 256px max dimension)
const frame = camera.captureFrame(256)
// frame.rgbPixels: Uint8Array (RGB, no alpha)
// frame.width, frame.height: actual dimensions

// Check state
console.log('Capturing:', camera.isCapturing)

camera.stop()
```

### Waiting for Camera Readiness

The camera stream takes a moment to initialize after `start()` resolves. Always guard against zero-dimension frames:

```typescript theme={null}
const camera = new VideoCapture({ facingMode: 'user' })
await camera.start()

// Option 1: Wait for loadedmetadata event
await new Promise<void>((resolve) => {
  const video = camera.videoElement
  if (video.videoWidth > 0 && video.videoHeight > 0) {
    resolve()
    return
  }
  video.addEventListener('loadedmetadata', () => resolve(), { once: true })
})

// Option 2: Guard in captureFrame calls
const frame = camera.captureFrame(256)
if (!frame || frame.width === 0 || frame.height === 0) {
  console.warn('Camera not ready yet, skipping frame')
  return
}
```

<Warning>
  Calling `captureFrame()` before the video stream is fully initialized causes: `Failed to execute
      'getImageData' on 'CanvasRenderingContext2D': The source width is 0.` This commonly happens in
  React components that call `captureFrame()` immediately after `start()` without waiting for video
  dimensions to be valid.
</Warning>

## Examples

### Live Camera Streaming

```typescript theme={null}
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

const camera = new VideoCapture({ facingMode: 'environment' })
await camera.start()

let processing = false

// Describe every 2.5 seconds — skip if still processing
const interval = setInterval(async () => {
  if (processing || !VLMWorkerBridge.shared.isModelLoaded) return
  processing = true

  const frame = camera.captureFrame(256)
  if (frame) {
    try {
      const result = await VLMWorkerBridge.shared.process(
        frame.rgbPixels,
        frame.width,
        frame.height,
        'Describe what you see in one sentence.',
        { maxTokens: 30 }
      )
      document.getElementById('description')!.textContent = result.text
    } catch (err) {
      const msg = (err as Error).message
      if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
        // WASM memory crash — skip frame and retry next tick
        console.warn('VLM WASM crash, retrying next frame...')
      } else {
        throw err
      }
    }
  }

  processing = false
}, 2500)

// Stop
clearInterval(interval)
camera.stop()
```

### React Component

```tsx VisionChat.tsx theme={null}
import { useState, useCallback, useRef, useEffect } from 'react'
import { ModelCategory } from '@runanywhere/web'
import { VLMWorkerBridge, VideoCapture } from '@runanywhere/web-llamacpp'

export function VisionChat() {
  const [description, setDescription] = useState('')
  const [isProcessing, setIsProcessing] = useState(false)
  const cameraRef = useRef<VideoCapture | null>(null)
  const videoRef = useRef<HTMLDivElement>(null)

  useEffect(() => {
    const camera = new VideoCapture({ facingMode: 'environment' })
    cameraRef.current = camera
    camera.start().then(() => {
      if (videoRef.current) {
        const el = camera.videoElement
        el.style.width = '100%'
        el.style.borderRadius = '12px'
        videoRef.current.appendChild(el)
      }
    })
    return () => camera.stop()
  }, [])

  const handleDescribe = useCallback(async () => {
    const frame = cameraRef.current?.captureFrame(256)
    if (!frame || !VLMWorkerBridge.shared.isModelLoaded) return

    setIsProcessing(true)
    try {
      const result = await VLMWorkerBridge.shared.process(
        frame.rgbPixels,
        frame.width,
        frame.height,
        'Describe what you see.',
        { maxTokens: 80 }
      )
      setDescription(result.text)
    } catch (err) {
      const msg = (err as Error).message
      if (msg.includes('memory access out of bounds')) {
        setDescription('Recovering from memory error... try again.')
      } else {
        setDescription('Error: ' + msg)
      }
    } finally {
      setIsProcessing(false)
    }
  }, [])

  return (
    <div>
      <div ref={videoRef} />
      <button onClick={handleDescribe} disabled={isProcessing}>
        {isProcessing ? 'Analyzing...' : 'Describe'}
      </button>
      <p>{description}</p>
    </div>
  )
}
```

## Supported Models

| Model                  | Architecture | HuggingFace Repo                         | Size    | Quality                     |
| ---------------------- | ------------ | ---------------------------------------- | ------- | --------------------------- |
| **LFM2-VL 450M Q4\_0** | Liquid AI    | `runanywhere/LFM2-VL-450M-GGUF`          | \~500MB | Good -- ultra-compact, fast |
| **LFM2-VL 450M Q8\_0** | Liquid AI    | `runanywhere/LFM2-VL-450M-GGUF`          | \~600MB | Good -- higher precision    |
| SmolVLM 500M Q8\_0     | SmolVLM      | `runanywhere/SmolVLM-500M-Instruct-GGUF` | \~500MB | Good                        |
| Qwen2-VL 2B Q4\_K\_M   | Qwen2VL      | `runanywhere/Qwen2-VL-2B-Instruct-GGUF`  | \~1.5GB | Better                      |
| LLaVA 7B Q4\_0         | LLaVA        | --                                       | \~4GB   | Best                        |

### Registering VLM Models

VLM models require two files: the main model GGUF and a multimodal projector (`mmproj`) GGUF. Register them using the `files` array -- the first file is the main model, the second is the projector:

```typescript theme={null}
import { RunAnywhere, ModelCategory, LLMFramework } from '@runanywhere/web'

RunAnywhere.registerModels([
  {
    id: 'lfm2-vl-450m-q4_0',
    name: 'LFM2-VL 450M Q4_0',
    repo: 'runanywhere/LFM2-VL-450M-GGUF',
    files: ['LFM2-VL-450M-Q4_0.gguf', 'mmproj-LFM2-VL-450M-Q8_0.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Multimodal,
    memoryRequirement: 500_000_000,
  },
])
```

## WASM Memory Crash Handling

VLM image encoding is computationally expensive in WASM and can occasionally trigger `memory access out of bounds` errors. Always wrap VLM calls in try/catch and handle these gracefully:

```typescript theme={null}
try {
  const result = await VLMWorkerBridge.shared.process(/* ... */)
} catch (err) {
  const msg = (err as Error).message
  if (msg.includes('memory access out of bounds') || msg.includes('RuntimeError')) {
    // Recoverable — the next frame will usually work
    console.warn('WASM memory crash, will retry')
  }
}
```

## Performance Tips

* **Use 256x256 capture dimensions** -- larger images dramatically increase encoding time
* **Use VLMWorkerBridge** instead of direct VLM -- it runs in a Web Worker and won't freeze the UI
* **Limit maxTokens** for descriptions (30-60 tokens for quick descriptions, 80+ for detailed analysis)
* **Liquid LFM2-VL 450M** is the best starting point -- smallest multimodal model, fast and memory-efficient (\~500MB)
* **Handle WASM crashes** -- `memory access out of bounds` is recoverable; display a retry message

## Related

<CardGroup cols={2}>
  <Card title="LLM Generation" icon="brain" href="/web/llm/generate">
    Text-only generation
  </Card>

  <Card title="Tool Calling" icon="wrench" href="/web/tool-calling">
    Function calling with LLMs
  </Card>

  <Card title="Best Practices" icon="star" href="/web/best-practices">
    Performance optimization
  </Card>
</CardGroup>
