Vision Language Models

The VLM (Vision Language Model) module enables multimodal inference — feed an image and a text prompt to get descriptions, answers, or visual analysis. Models run entirely on-device using llama.cpp with multimodal projector support.

Overview

VLM handles:

Image encoding — Platform-aware constructors for iOS (UIImage) and macOS (raw RGB pixels)
Model loading — Multi-file GGUF models with multimodal projector
Streaming generation — Token-by-token output via AsyncSequence
Cancellation — Interrupt long-running generations at any time

Basic Usage

import RunAnywhere

// 1. Load a VLM model (requires full ModelDescriptor)
let models = try await RunAnywhere.availableModels()
let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" })!
try await RunAnywhere.loadVLMModel(model: vlmModel)

// 2. Create VLMImage from platform image
#if os(iOS)
let image = VLMImage(image: uiImage)
#elseif os(macOS)
let image = VLMImage(rgbPixels: pixelData, width: 256, height: 256)
#endif

// 3. Process with streaming
let result = try await RunAnywhere.processImageStream(
    image: image,
    prompt: "Describe what you see.",
    maxTokens: 128
)

var fullResponse = ""
for try await token in result.stream {
    fullResponse += token
}
print(fullResponse)

VLM loading requires a full ModelDescriptor retrieved from availableModels(), not just a model ID string. This is different from LLM loading which accepts a plain string ID.

Setup

Register a VLM Model

VLM models require two GGUF files: the main language model and a multimodal projector (mmproj). Use registerMultiFileModel to register both:

import RunAnywhere

RunAnywhere.registerMultiFileModel(
    id: "smolvlm-256m-instruct",
    name: "SmolVLM 256M Instruct",
    files: [
        ModelFileDescriptor(
            url: URL(string: "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/SmolVLM-256M-Instruct-Q8_0.gguf")!,
            filename: "SmolVLM-256M-Instruct-Q8_0.gguf"
        ),
        ModelFileDescriptor(
            url: URL(string: "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-256M-Instruct-f16.gguf")!,
            filename: "mmproj-SmolVLM-256M-Instruct-f16.gguf"
        ),
    ],
    framework: .llamaCpp,
    modality: .multimodal,
    memoryRequirement: 365_000_000
)

The first file is the main model GGUF, and the second is the multimodal projector. Both are required for VLM inference.

Load the Model

let models = try await RunAnywhere.availableModels()

guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else {
    print("Model not found — register it first")
    return
}

try await RunAnywhere.loadVLMModel(model: vlmModel)

// Verify the model is loaded
if await RunAnywhere.isVLMModelLoaded {
    print("VLM ready")
}

API Reference

VLMImage

Platform-conditional image wrapper for VLM input.

Platform	Constructor	Parameters
iOS	`VLMImage(image:)`	`UIImage` — any `UIImage` instance
macOS	`VLMImage(rgbPixels:width:height:)`	`Data` — raw RGB pixel bytes, `Int` width, `Int` height

Model Operations

Method	Description
`RunAnywhere.availableModels() async throws -> [ModelDescriptor]`	Lists all registered models
`RunAnywhere.loadVLMModel(model: ModelDescriptor) async throws`	Loads a VLM model from its descriptor
`RunAnywhere.isVLMModelLoaded: Bool` (async)	Whether a VLM model is currently loaded
`RunAnywhere.cancelVLMGeneration() async`	Cancels any in-progress VLM generation

processImageStream

let result = try await RunAnywhere.processImageStream(
    image: VLMImage,
    prompt: String,
    maxTokens: Int
)

Parameter	Type	Description
`image`	`VLMImage`	The image to analyze
`prompt`	`String`	The text prompt describing what to analyze
`maxTokens`	`Int`	Maximum number of tokens to generate

Returns a result object with a .stream property — an AsyncSequence of String tokens.

registerMultiFileModel

RunAnywhere.registerMultiFileModel(
    id: String,
    name: String,
    files: [ModelFileDescriptor],
    framework: ModelFramework,
    modality: ModelModality,
    memoryRequirement: Int
)

Parameter	Type	Description
`id`	`String`	Unique model identifier
`name`	`String`	Human-readable model name
`files`	`[ModelFileDescriptor]`	Array of model files (main model + mmproj)
`framework`	`ModelFramework`	Must be `.llamaCpp` for VLM
`modality`	`ModelModality`	Must be `.multimodal` for VLM
`memoryRequirement`	`Int`	Estimated memory usage in bytes

Platform-Specific Image Handling

VLM image creation differs between iOS and macOS. Use conditional compilation to handle both:

import RunAnywhere

#if os(iOS)
import UIKit

func createVLMImage(from uiImage: UIImage) -> VLMImage {
    return VLMImage(image: uiImage)
}
#elseif os(macOS)
import AppKit

func createVLMImage(from nsImage: NSImage) -> VLMImage? {
    guard let tiffData = nsImage.tiffRepresentation,
          let bitmap = NSBitmapImageRep(data: tiffData) else {
        return nil
    }

    let width = bitmap.pixelsWide
    let height = bitmap.pixelsHigh
    var rgbData = Data(capacity: width * height * 3)

    for y in 0..<height {
        for x in 0..<width {
            guard let color = bitmap.colorAt(x: x, y: y)?.usingColorSpace(.deviceRGB) else {
                continue
            }
            rgbData.append(UInt8(color.redComponent * 255))
            rgbData.append(UInt8(color.greenComponent * 255))
            rgbData.append(UInt8(color.blueComponent * 255))
        }
    }

    return VLMImage(rgbPixels: rgbData, width: width, height: height)
}
#endif

Examples

Complete SwiftUI App with PhotosPicker

import SwiftUI
import PhotosUI
import RunAnywhere

@Observable
class VLMViewModel {
    var selectedImage: UIImage?
    var prompt = "Describe this image in detail."
    var response = ""
    var isLoading = false
    var isModelLoaded = false
    var loadingStatus = "Not loaded"
    var maxTokens = 128

    func loadModel() async {
        loadingStatus = "Loading model..."
        do {
            let models = try await RunAnywhere.availableModels()
            guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else {
                loadingStatus = "Model not registered"
                return
            }
            try await RunAnywhere.loadVLMModel(model: vlmModel)
            isModelLoaded = await RunAnywhere.isVLMModelLoaded
            loadingStatus = isModelLoaded ? "Ready" : "Failed to load"
        } catch {
            loadingStatus = "Error: \(error.localizedDescription)"
        }
    }

    func analyzeImage() async {
        guard let uiImage = selectedImage else { return }

        isLoading = true
        response = ""

        do {
            #if os(iOS)
            let image = VLMImage(image: uiImage)
            #elseif os(macOS)
            guard let image = createVLMImage(from: uiImage) else {
                response = "Failed to convert image"
                isLoading = false
                return
            }
            #endif

            let result = try await RunAnywhere.processImageStream(
                image: image,
                prompt: prompt,
                maxTokens: maxTokens
            )

            for try await token in result.stream {
                await MainActor.run {
                    response += token
                }
            }
        } catch {
            response = "Error: \(error.localizedDescription)"
        }

        isLoading = false
    }

    func cancel() async {
        await RunAnywhere.cancelVLMGeneration()
        isLoading = false
    }
}

struct VLMDemoView: View {
    @State private var viewModel = VLMViewModel()
    @State private var selectedItem: PhotosPickerItem?

    var body: some View {
        NavigationStack {
            ScrollView {
                VStack(spacing: 20) {
                    // Model status
                    HStack {
                        Circle()
                            .fill(viewModel.isModelLoaded ? .green : .orange)
                            .frame(width: 8, height: 8)
                        Text(viewModel.loadingStatus)
                            .font(.caption)
                            .foregroundStyle(.secondary)
                        Spacer()
                    }

                    // Image picker
                    PhotosPicker(selection: $selectedItem, matching: .images) {
                        Group {
                            if let image = viewModel.selectedImage {
                                Image(uiImage: image)
                                    .resizable()
                                    .scaledToFit()
                                    .frame(maxHeight: 300)
                                    .clipShape(RoundedRectangle(cornerRadius: 12))
                            } else {
                                RoundedRectangle(cornerRadius: 12)
                                    .fill(.quaternary)
                                    .frame(height: 200)
                                    .overlay {
                                        VStack(spacing: 8) {
                                            Image(systemName: "photo.badge.plus")
                                                .font(.largeTitle)
                                            Text("Select an image")
                                        }
                                        .foregroundStyle(.secondary)
                                    }
                            }
                        }
                    }

                    // Prompt input
                    TextField("What do you want to know?", text: $viewModel.prompt, axis: .vertical)
                        .textFieldStyle(.roundedBorder)
                        .lineLimit(2...4)

                    // Action buttons
                    HStack {
                        Button("Analyze") {
                            Task { await viewModel.analyzeImage() }
                        }
                        .buttonStyle(.borderedProminent)
                        .disabled(!viewModel.isModelLoaded || viewModel.selectedImage == nil || viewModel.isLoading)

                        if viewModel.isLoading {
                            Button("Cancel", role: .destructive) {
                                Task { await viewModel.cancel() }
                            }
                            .buttonStyle(.bordered)
                        }
                    }

                    // Response
                    if !viewModel.response.isEmpty {
                        VStack(alignment: .leading, spacing: 8) {
                            Text("Response")
                                .font(.headline)
                            Text(viewModel.response)
                                .textSelection(.enabled)
                                .padding()
                                .frame(maxWidth: .infinity, alignment: .leading)
                                .background(.ultraThinMaterial)
                                .clipShape(RoundedRectangle(cornerRadius: 8))
                        }
                    }

                    if viewModel.isLoading {
                        ProgressView("Analyzing image...")
                    }
                }
                .padding()
            }
            .navigationTitle("Vision AI")
            .task { await viewModel.loadModel() }
            .onChange(of: selectedItem) { _, newValue in
                Task {
                    if let data = try? await newValue?.loadTransferable(type: Data.self),
                       let image = UIImage(data: data) {
                        viewModel.selectedImage = image
                    }
                }
            }
        }
    }
}

Batch Image Analysis

func analyzeMultipleImages(_ images: [UIImage], prompt: String) async throws -> [String] {
    var results: [String] = []

    for uiImage in images {
        let image = VLMImage(image: uiImage)
        let result = try await RunAnywhere.processImageStream(
            image: image,
            prompt: prompt,
            maxTokens: 64
        )

        var text = ""
        for try await token in result.stream {
            text += token
        }
        results.append(text)
    }

    return results
}

Error Handling

do {
    let models = try await RunAnywhere.availableModels()
    guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else {
        print("Model not registered — call registerMultiFileModel first")
        return
    }

    try await RunAnywhere.loadVLMModel(model: vlmModel)
    let result = try await RunAnywhere.processImageStream(
        image: image,
        prompt: "Describe this",
        maxTokens: 128
    )

    for try await token in result.stream {
        response += token
    }
} catch let error as SDKError {
    switch error.code {
    case .modelNotFound:
        print("VLM model not found — ensure both GGUF files are downloaded")
    case .notInitialized:
        print("Load a VLM model before calling processImageStream")
    case .processingFailed:
        print("Image processing failed: \(error.message)")
    case .cancelled:
        print("Generation was cancelled")
    default:
        print("VLM error: \(error)")
    }
}

Best Practices

Use registerMultiFileModel for VLMs

VLM models always require two GGUF files — the language model and the multimodal projector (mmproj). Use registerMultiFileModel instead of registerModel. Missing the projector file will cause loading failures.

Retrieve ModelDescriptor before loading

Unlike LLM loading which accepts a plain string ID, VLM loading requires a full ModelDescriptor from availableModels(). Always fetch and filter the descriptor before calling loadVLMModel.

Handle platform differences

Use #if os(iOS) and #elseif os(macOS) for VLMImage construction. iOS uses UIImage directly while macOS requires raw RGB pixel data with explicit dimensions.

Limit maxTokens for quick descriptions

Use 30–64 tokens for one-line descriptions and 128–256 for detailed analysis. Larger values increase latency on mobile devices.

Cancel long generations

Always provide a cancel button in your UI. Call cancelVLMGeneration() to immediately halt token generation and free resources.

Downscale images before processing

Large images increase encoding time significantly. Resize to 256–512px on the longest edge before creating a VLMImage for optimal latency.

Supported Models

Model	Architecture	Size	Quality
SmolVLM 256M Instruct	SmolVLM	~365MB	Good — ultra-compact, fast
SmolVLM 500M Instruct	SmolVLM	~500MB	Better
Qwen2-VL 2B Instruct	Qwen2VL	~1.5GB	Best

LLM Generation

Text-only generation

Tool Calling

Function calling with LLMs

Image Generation

Generate images with Stable Diffusion

Best Practices

Performance optimization

Getting Started

Swift SDK

Kotlin SDK

React Native SDK

Flutter SDK

Web SDK

Vibe Coding

Overview

Basic Usage

Setup

Register a VLM Model

Load the Model

API Reference

VLMImage

Model Operations

processImageStream

registerMultiFileModel

Platform-Specific Image Handling

Examples

Complete SwiftUI App with PhotosPicker

Batch Image Analysis

Error Handling

Best Practices

Supported Models

LLM Generation

Tool Calling

Image Generation

Best Practices

​Overview

​Basic Usage

​Setup

​Register a VLM Model

​Load the Model

​API Reference

​VLMImage

​Model Operations

​processImageStream

​registerMultiFileModel

​Platform-Specific Image Handling

​Examples

​Complete SwiftUI App with PhotosPicker

​Batch Image Analysis

​Error Handling

​Best Practices

​Supported Models

​Related

LLM Generation

Tool Calling

Image Generation

Best Practices

Overview

Basic Usage

Setup

Register a VLM Model

Load the Model

API Reference

VLMImage

Model Operations

processImageStream

registerMultiFileModel

Platform-Specific Image Handling

Examples

Complete SwiftUI App with PhotosPicker

Batch Image Analysis

Error Handling

Best Practices

Supported Models

Related