Multimodal image + text inference on-device with Swift
The VLM (Vision Language Model) module enables multimodal inference — feed an image and a text prompt to get descriptions, answers, or visual analysis. Models run entirely on-device using llama.cpp with multimodal projector support.
import RunAnywhere// 1. Load a VLM model (requires full ModelDescriptor)let models = try await RunAnywhere.availableModels()let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" })!try await RunAnywhere.loadVLMModel(model: vlmModel)// 2. Create VLMImage from platform image#if os(iOS)let image = VLMImage(image: uiImage)#elseif os(macOS)let image = VLMImage(rgbPixels: pixelData, width: 256, height: 256)#endif// 3. Process with streaminglet result = try await RunAnywhere.processImageStream( image: image, prompt: "Describe what you see.", maxTokens: 128)var fullResponse = ""for try await token in result.stream { fullResponse += token}print(fullResponse)
VLM loading requires a full ModelDescriptor retrieved from availableModels(), not just a model
ID string. This is different from LLM loading which accepts a plain string ID.
let models = try await RunAnywhere.availableModels()guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else { print("Model not found — register it first") return}try await RunAnywhere.loadVLMModel(model: vlmModel)// Verify the model is loadedif await RunAnywhere.isVLMModelLoaded { print("VLM ready")}
do { let models = try await RunAnywhere.availableModels() guard let vlmModel = models.first(where: { $0.id == "smolvlm-256m-instruct" }) else { print("Model not registered — call registerMultiFileModel first") return } try await RunAnywhere.loadVLMModel(model: vlmModel) let result = try await RunAnywhere.processImageStream( image: image, prompt: "Describe this", maxTokens: 128 ) for try await token in result.stream { response += token }} catch let error as SDKError { switch error.code { case .modelNotFound: print("VLM model not found — ensure both GGUF files are downloaded") case .notInitialized: print("Load a VLM model before calling processImageStream") case .processingFailed: print("Image processing failed: \(error.message)") case .cancelled: print("Generation was cancelled") default: print("VLM error: \(error)") }}
VLM models always require two GGUF files — the language model and the multimodal projector (mmproj). Use registerMultiFileModel instead of registerModel. Missing the projector file will cause loading failures.
Retrieve ModelDescriptor before loading
Unlike LLM loading which accepts a plain string ID, VLM loading requires a full ModelDescriptor
from availableModels(). Always fetch and filter the descriptor before calling loadVLMModel.
Handle platform differences
Use #if os(iOS) and #elseif os(macOS) for VLMImage construction. iOS uses UIImage directly
while macOS requires raw RGB pixel data with explicit dimensions.
Limit maxTokens for quick descriptions
Use 30–64 tokens for one-line descriptions and 128–256 for detailed analysis. Larger values
increase latency on mobile devices.
Cancel long generations
Always provide a cancel button in your UI. Call cancelVLMGeneration() to immediately halt token
generation and free resources.
Downscale images before processing
Large images increase encoding time significantly. Resize to 256–512px on the longest edge before creating a VLMImage for optimal latency.