Skip to main content
The VLM (Vision Language Model) module enables multimodal inference — feed an image and a text prompt to get descriptions, answers, or visual analysis. Models run entirely on-device using llama.cpp with multimodal projector support.

Overview

VLM handles:
  • Image encoding — Platform-aware constructors for iOS (UIImage) and macOS (raw RGB pixels)
  • Model loading — Multi-file GGUF models with multimodal projector
  • Streaming generation — Token-by-token output via AsyncSequence
  • Cancellation — Interrupt long-running generations at any time

Basic Usage

import RunAnywhere

// 1. Load a VLM model (requires full ModelDescriptor)
let models = try await RunAnywhere.availableModels()
let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" })!
try await RunAnywhere.loadVLMModel(model: vlmModel)

// 2. Create VLMImage from platform image
#if os(iOS)
let image = VLMImage(image: uiImage)
#elseif os(macOS)
let image = VLMImage(rgbPixels: pixelData, width: 256, height: 256)
#endif

// 3. Process with streaming
let result = try await RunAnywhere.processImageStream(
    image: image,
    prompt: "Describe what you see.",
    maxTokens: 128
)

var fullResponse = ""
for try await token in result.stream {
    fullResponse += token
}
print(fullResponse)
VLM loading requires a full ModelDescriptor retrieved from availableModels(), not just a model ID string. This is different from LLM loading which accepts a plain string ID.

Setup

Register a VLM Model

VLM models require two GGUF files: the main language model and a multimodal projector (mmproj). Use registerMultiFileModel to register both:
import RunAnywhere

RunAnywhere.registerMultiFileModel(
    id: "smolvlm-256m-instruct",
    name: "SmolVLM 256M Instruct",
    files: [
        ModelFileDescriptor(
            url: URL(string: "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/SmolVLM-256M-Instruct-Q8_0.gguf")!,
            filename: "SmolVLM-256M-Instruct-Q8_0.gguf"
        ),
        ModelFileDescriptor(
            url: URL(string: "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-256M-Instruct-f16.gguf")!,
            filename: "mmproj-SmolVLM-256M-Instruct-f16.gguf"
        ),
    ],
    framework: .llamaCpp,
    modality: .multimodal,
    memoryRequirement: 365_000_000
)
The first file is the main model GGUF, and the second is the multimodal projector. Both are required for VLM inference.

Load the Model

let models = try await RunAnywhere.availableModels()

guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else {
    print("Model not found — register it first")
    return
}

try await RunAnywhere.loadVLMModel(model: vlmModel)

// Verify the model is loaded
if await RunAnywhere.isVLMModelLoaded {
    print("VLM ready")
}

API Reference

VLMImage

Platform-conditional image wrapper for VLM input.
PlatformConstructorParameters
iOSVLMImage(image:)UIImage — any UIImage instance
macOSVLMImage(rgbPixels:width:height:)Data — raw RGB pixel bytes, Int width, Int height

Model Operations

MethodDescription
RunAnywhere.availableModels() async throws -> [ModelDescriptor]Lists all registered models
RunAnywhere.loadVLMModel(model: ModelDescriptor) async throwsLoads a VLM model from its descriptor
RunAnywhere.isVLMModelLoaded: Bool (async)Whether a VLM model is currently loaded
RunAnywhere.cancelVLMGeneration() asyncCancels any in-progress VLM generation

processImageStream

let result = try await RunAnywhere.processImageStream(
    image: VLMImage,
    prompt: String,
    maxTokens: Int
)
ParameterTypeDescription
imageVLMImageThe image to analyze
promptStringThe text prompt describing what to analyze
maxTokensIntMaximum number of tokens to generate
Returns a result object with a .stream property — an AsyncSequence of String tokens.

registerMultiFileModel

RunAnywhere.registerMultiFileModel(
    id: String,
    name: String,
    files: [ModelFileDescriptor],
    framework: ModelFramework,
    modality: ModelModality,
    memoryRequirement: Int
)
ParameterTypeDescription
idStringUnique model identifier
nameStringHuman-readable model name
files[ModelFileDescriptor]Array of model files (main model + mmproj)
frameworkModelFrameworkMust be .llamaCpp for VLM
modalityModelModalityMust be .multimodal for VLM
memoryRequirementIntEstimated memory usage in bytes

Platform-Specific Image Handling

VLM image creation differs between iOS and macOS. Use conditional compilation to handle both:
import RunAnywhere

#if os(iOS)
import UIKit

func createVLMImage(from uiImage: UIImage) -> VLMImage {
    return VLMImage(image: uiImage)
}
#elseif os(macOS)
import AppKit

func createVLMImage(from nsImage: NSImage) -> VLMImage? {
    guard let tiffData = nsImage.tiffRepresentation,
          let bitmap = NSBitmapImageRep(data: tiffData) else {
        return nil
    }

    let width = bitmap.pixelsWide
    let height = bitmap.pixelsHigh
    var rgbData = Data(capacity: width * height * 3)

    for y in 0..<height {
        for x in 0..<width {
            guard let color = bitmap.colorAt(x: x, y: y)?.usingColorSpace(.deviceRGB) else {
                continue
            }
            rgbData.append(UInt8(color.redComponent * 255))
            rgbData.append(UInt8(color.greenComponent * 255))
            rgbData.append(UInt8(color.blueComponent * 255))
        }
    }

    return VLMImage(rgbPixels: rgbData, width: width, height: height)
}
#endif

Examples

Complete SwiftUI App with PhotosPicker

import SwiftUI
import PhotosUI
import RunAnywhere

@Observable
class VLMViewModel {
    var selectedImage: UIImage?
    var prompt = "Describe this image in detail."
    var response = ""
    var isLoading = false
    var isModelLoaded = false
    var loadingStatus = "Not loaded"
    var maxTokens = 128

    func loadModel() async {
        loadingStatus = "Loading model..."
        do {
            let models = try await RunAnywhere.availableModels()
            guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else {
                loadingStatus = "Model not registered"
                return
            }
            try await RunAnywhere.loadVLMModel(model: vlmModel)
            isModelLoaded = await RunAnywhere.isVLMModelLoaded
            loadingStatus = isModelLoaded ? "Ready" : "Failed to load"
        } catch {
            loadingStatus = "Error: \(error.localizedDescription)"
        }
    }

    func analyzeImage() async {
        guard let uiImage = selectedImage else { return }

        isLoading = true
        response = ""

        do {
            #if os(iOS)
            let image = VLMImage(image: uiImage)
            #elseif os(macOS)
            guard let image = createVLMImage(from: uiImage) else {
                response = "Failed to convert image"
                isLoading = false
                return
            }
            #endif

            let result = try await RunAnywhere.processImageStream(
                image: image,
                prompt: prompt,
                maxTokens: maxTokens
            )

            for try await token in result.stream {
                await MainActor.run {
                    response += token
                }
            }
        } catch {
            response = "Error: \(error.localizedDescription)"
        }

        isLoading = false
    }

    func cancel() async {
        await RunAnywhere.cancelVLMGeneration()
        isLoading = false
    }
}

struct VLMDemoView: View {
    @State private var viewModel = VLMViewModel()
    @State private var selectedItem: PhotosPickerItem?

    var body: some View {
        NavigationStack {
            ScrollView {
                VStack(spacing: 20) {
                    // Model status
                    HStack {
                        Circle()
                            .fill(viewModel.isModelLoaded ? .green : .orange)
                            .frame(width: 8, height: 8)
                        Text(viewModel.loadingStatus)
                            .font(.caption)
                            .foregroundStyle(.secondary)
                        Spacer()
                    }

                    // Image picker
                    PhotosPicker(selection: $selectedItem, matching: .images) {
                        Group {
                            if let image = viewModel.selectedImage {
                                Image(uiImage: image)
                                    .resizable()
                                    .scaledToFit()
                                    .frame(maxHeight: 300)
                                    .clipShape(RoundedRectangle(cornerRadius: 12))
                            } else {
                                RoundedRectangle(cornerRadius: 12)
                                    .fill(.quaternary)
                                    .frame(height: 200)
                                    .overlay {
                                        VStack(spacing: 8) {
                                            Image(systemName: "photo.badge.plus")
                                                .font(.largeTitle)
                                            Text("Select an image")
                                        }
                                        .foregroundStyle(.secondary)
                                    }
                            }
                        }
                    }

                    // Prompt input
                    TextField("What do you want to know?", text: $viewModel.prompt, axis: .vertical)
                        .textFieldStyle(.roundedBorder)
                        .lineLimit(2...4)

                    // Action buttons
                    HStack {
                        Button("Analyze") {
                            Task { await viewModel.analyzeImage() }
                        }
                        .buttonStyle(.borderedProminent)
                        .disabled(!viewModel.isModelLoaded || viewModel.selectedImage == nil || viewModel.isLoading)

                        if viewModel.isLoading {
                            Button("Cancel", role: .destructive) {
                                Task { await viewModel.cancel() }
                            }
                            .buttonStyle(.bordered)
                        }
                    }

                    // Response
                    if !viewModel.response.isEmpty {
                        VStack(alignment: .leading, spacing: 8) {
                            Text("Response")
                                .font(.headline)
                            Text(viewModel.response)
                                .textSelection(.enabled)
                                .padding()
                                .frame(maxWidth: .infinity, alignment: .leading)
                                .background(.ultraThinMaterial)
                                .clipShape(RoundedRectangle(cornerRadius: 8))
                        }
                    }

                    if viewModel.isLoading {
                        ProgressView("Analyzing image...")
                    }
                }
                .padding()
            }
            .navigationTitle("Vision AI")
            .task { await viewModel.loadModel() }
            .onChange(of: selectedItem) { _, newValue in
                Task {
                    if let data = try? await newValue?.loadTransferable(type: Data.self),
                       let image = UIImage(data: data) {
                        viewModel.selectedImage = image
                    }
                }
            }
        }
    }
}

Batch Image Analysis

func analyzeMultipleImages(_ images: [UIImage], prompt: String) async throws -> [String] {
    var results: [String] = []

    for uiImage in images {
        let image = VLMImage(image: uiImage)
        let result = try await RunAnywhere.processImageStream(
            image: image,
            prompt: prompt,
            maxTokens: 64
        )

        var text = ""
        for try await token in result.stream {
            text += token
        }
        results.append(text)
    }

    return results
}

Error Handling

do {
    let models = try await RunAnywhere.availableModels()
    guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else {
        print("Model not registered — call registerMultiFileModel first")
        return
    }

    try await RunAnywhere.loadVLMModel(model: vlmModel)
    let result = try await RunAnywhere.processImageStream(
        image: image,
        prompt: "Describe this",
        maxTokens: 128
    )

    for try await token in result.stream {
        response += token
    }
} catch let error as SDKError {
    switch error.code {
    case .modelNotFound:
        print("VLM model not found — ensure both GGUF files are downloaded")
    case .notInitialized:
        print("Load a VLM model before calling processImageStream")
    case .processingFailed:
        print("Image processing failed: \(error.message)")
    case .cancelled:
        print("Generation was cancelled")
    default:
        print("VLM error: \(error)")
    }
}

Best Practices

VLM models always require two GGUF files — the language model and the multimodal projector (mmproj). Use registerMultiFileModel instead of registerModel. Missing the projector file will cause loading failures.
Unlike LLM loading which accepts a plain string ID, VLM loading requires a full ModelDescriptor from availableModels(). Always fetch and filter the descriptor before calling loadVLMModel.
Use #if os(iOS) and #elseif os(macOS) for VLMImage construction. iOS uses UIImage directly while macOS requires raw RGB pixel data with explicit dimensions.
Use 30–64 tokens for one-line descriptions and 128–256 for detailed analysis. Larger values increase latency on mobile devices.
Always provide a cancel button in your UI. Call cancelVLMGeneration() to immediately halt token generation and free resources.
Large images increase encoding time significantly. Resize to 256–512px on the longest edge before creating a VLMImage for optimal latency.

Supported Models

ModelArchitectureSizeQuality
SmolVLM 256M InstructSmolVLM~365MBGood — ultra-compact, fast
SmolVLM 500M InstructSmolVLM~500MBBetter
Qwen2-VL 2B InstructQwen2VL~1.5GBBest