Vision Language Models

Overview

The VLM (Vision Language Model) module enables on-device multimodal inference — feed an image and a text prompt to get descriptions, answers, or analysis. VLM runs entirely on the device using llama.cpp with no data leaving the phone.

Package Imports

import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.VLM.VLMGenerationOptions
import com.runanywhere.sdk.public.extensions.VLM.VLMImage
import com.runanywhere.sdk.public.extensions.cancelVLMGeneration
import com.runanywhere.sdk.public.extensions.processImageStream

Model registration types come from the core SDK:

import com.runanywhere.sdk.public.ModelFileDescriptor
import com.runanywhere.sdk.public.InferenceFramework
import com.runanywhere.sdk.public.ModelCategory

Basic Usage

val image = VLMImage.fromFilePath("/data/user/0/com.example/cache/photo.jpg")
val options = VLMGenerationOptions(maxTokens = 300)

var description = ""
RunAnywhere.processImageStream(image, "Describe this image.", options)
    .collect { token ->
        description += token
        updateUI(description)
    }

processImageStream returns a Flow<String> where each emission is a single token, enabling real-time streaming display.

Model Setup

Register a VLM Model

VLM models require two files: the main model GGUF and a multimodal projector (mmproj) GGUF. Register them using registerMultiFileModel:

RunAnywhere.registerMultiFileModel(
    id = "smolvlm-256m-instruct",
    name = "SmolVLM 256M Instruct",
    files = listOf(
        ModelFileDescriptor(
            url = "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/SmolVLM-256M-Instruct-Q8_0.gguf",
            filename = "SmolVLM-256M-Instruct-Q8_0.gguf"
        ),
        ModelFileDescriptor(
            url = "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-256M-Instruct-f16.gguf",
            filename = "mmproj-SmolVLM-256M-Instruct-f16.gguf"
        ),
    ),
    framework = InferenceFramework.LLAMA_CPP,
    modality = ModelCategory.MULTIMODAL,
    memoryRequirement = 365_000_000L
)

Load and Unload

RunAnywhere.loadVLMModel("smolvlm-256m-instruct")

// Check model state
if (RunAnywhere.isVLMModelLoaded) {
    // Ready for inference
}

// Unload when done
try {
    RunAnywhere.unloadVLMModel()
} catch (e: Exception) {
    // Safe to ignore — model may already be unloaded
}

unloadVLMModel() can throw if the model is not currently loaded. Always wrap it in a try/catch block.

API Reference

VLMImage

Factory Method	Parameters	Description
`VLMImage.fromFilePath(path)`	`path: String`	Creates a VLM image from an on-disk file path

VLMImage.fromFilePath() requires a file system path, not a content URI. If you’re using Android’s photo picker or ACTION_OPEN_DOCUMENT, you must first copy the image to a temporary file. See the image picker example below.

VLMGenerationOptions

data class VLMGenerationOptions(
    val maxTokens: Int = 512
)

Parameter	Type	Default	Description
`maxTokens`	`Int`	`512`	Maximum number of tokens to generate

processImageStream

fun RunAnywhere.processImageStream(
    image: VLMImage,
    prompt: String,
    options: VLMGenerationOptions
): Flow<String>

Returns a Flow<String> that emits one token at a time. Collect the flow to build up the full response incrementally.

Parameter	Type	Description
`image`	`VLMImage`	Image to analyze
`prompt`	`String`	Text prompt for the model
`options`	`VLMGenerationOptions`	Generation configuration

cancelVLMGeneration

fun RunAnywhere.cancelVLMGeneration()

Cancels the current VLM generation. Safe to call even if no generation is in progress.

Model State

Property / Method	Type	Description
`RunAnywhere.isVLMModelLoaded`	`Boolean` (property)	Whether a VLM model is currently loaded
`RunAnywhere.loadVLMModel(id)`	`suspend`	Loads a registered VLM model by ID
`RunAnywhere.unloadVLMModel()`	`suspend`	Unloads the current VLM model (can throw)

isVLMModelLoaded is a property, not a suspend function. Unlike other model checks in the SDK (e.g., isLLMModelLoaded()), you access it directly without calling it.

// Correct
if (RunAnywhere.isVLMModelLoaded) { ... }

// Incorrect — will not compile
if (RunAnywhere.isVLMModelLoaded()) { ... }

Model Registration

fun RunAnywhere.registerMultiFileModel(
    id: String,
    name: String,
    files: List<ModelFileDescriptor>,
    framework: InferenceFramework,
    modality: ModelCategory,
    memoryRequirement: Long
)

Parameter	Type	Description
`id`	`String`	Unique model identifier
`name`	`String`	Human-readable model name
`files`	`List<ModelFileDescriptor>`	Model GGUF + mmproj GGUF (order matters)
`framework`	`InferenceFramework`	Must be `InferenceFramework.LLAMA_CPP` for VLM
`modality`	`ModelCategory`	Must be `ModelCategory.MULTIMODAL` for VLM
`memoryRequirement`	`Long`	Estimated memory usage in bytes

Examples

Jetpack Compose with Image Picker

A complete example showing image selection, VLM processing, and streaming token display:

@Composable
fun VLMScreen() {
    val context = LocalContext.current
    var selectedImageUri by remember { mutableStateOf<Uri?>(null) }
    var description by remember { mutableStateOf("") }
    var isProcessing by remember { mutableStateOf(false) }
    var prompt by remember { mutableStateOf("Describe this image in detail.") }
    val scope = rememberCoroutineScope()

    val imagePicker = rememberLauncherForActivityResult(
        contract = ActivityResultContracts.PickVisualMedia()
    ) { uri -> selectedImageUri = uri }

    Column(
        modifier = Modifier
            .fillMaxSize()
            .padding(16.dp),
        verticalArrangement = Arrangement.spacedBy(12.dp)
    ) {
        selectedImageUri?.let { uri ->
            AsyncImage(
                model = uri,
                contentDescription = "Selected image",
                modifier = Modifier
                    .fillMaxWidth()
                    .height(250.dp)
                    .clip(RoundedCornerShape(12.dp)),
                contentScale = ContentScale.Crop
            )
        }

        OutlinedTextField(
            value = prompt,
            onValueChange = { prompt = it },
            label = { Text("Prompt") },
            modifier = Modifier.fillMaxWidth()
        )

        Row(horizontalArrangement = Arrangement.spacedBy(8.dp)) {
            Button(onClick = {
                imagePicker.launch(PickVisualMediaRequest(ActivityResultContracts.PickVisualMedia.ImageOnly))
            }) {
                Text("Pick Image")
            }

            Button(
                onClick = {
                    val uri = selectedImageUri ?: return@Button
                    scope.launch {
                        isProcessing = true
                        description = ""

                        val tempFile = saveUriToTempFile(context, uri)
                        if (tempFile != null) {
                            val vlmImage = VLMImage.fromFilePath(tempFile.absolutePath)
                            val options = VLMGenerationOptions(maxTokens = 300)

                            RunAnywhere.processImageStream(vlmImage, prompt, options)
                                .collect { token -> description += token }
                        } else {
                            description = "Error: Could not load image."
                        }

                        isProcessing = false
                    }
                },
                enabled = selectedImageUri != null && !isProcessing
                    && RunAnywhere.isVLMModelLoaded
            ) {
                Text(if (isProcessing) "Analyzing..." else "Describe")
            }

            if (isProcessing) {
                Button(onClick = { RunAnywhere.cancelVLMGeneration() }) {
                    Text("Cancel")
                }
            }
        }

        if (description.isNotEmpty()) {
            Text(
                text = description,
                modifier = Modifier
                    .fillMaxWidth()
                    .verticalScroll(rememberScrollState())
            )
        }
    }
}

private fun saveUriToTempFile(context: Context, uri: Uri): File? {
    return try {
        val inputStream = context.contentResolver.openInputStream(uri) ?: return null
        val tempFile = File.createTempFile("vlm_", ".jpg", context.cacheDir)
        tempFile.outputStream().use { output -> inputStream.copyTo(output) }
        inputStream.close()
        tempFile
    } catch (e: Exception) {
        null
    }
}

The saveUriToTempFile helper is essential when using Android’s photo picker. Content URIs from PickVisualMedia cannot be passed directly to VLMImage.fromFilePath() — you must first write the image to a temporary JPEG file on disk.

Batch Image Description

suspend fun describeImages(paths: List<String>): List<String> {
    val options = VLMGenerationOptions(maxTokens = 100)

    return paths.map { path ->
        var result = ""
        val image = VLMImage.fromFilePath(path)
        RunAnywhere.processImageStream(image, "Describe briefly.", options)
            .collect { token -> result += token }
        result
    }
}

Error Handling

try {
    RunAnywhere.processImageStream(vlmImage, prompt, options)
        .collect { token -> description += token }
} catch (e: Exception) {
    when {
        e.message?.contains("model not loaded") == true -> {
            // Load the model first
            RunAnywhere.loadVLMModel("smolvlm-256m-instruct")
        }
        e.message?.contains("file not found") == true -> {
            // Image path is invalid
            showError("Could not find image file")
        }
        else -> {
            showError("VLM error: ${e.message}")
        }
    }
}

Supported Models

Model	Architecture	Size	Memory	Quality
SmolVLM 256M Q8_0	SmolVLM	~300MB	365MB	Good — ultra-compact, fastest
SmolVLM 500M Q8_0	SmolVLM	~500MB	550MB	Good
Qwen2-VL 2B Q4_K_M	Qwen2VL	~1.5GB	1.8GB	Better
LLaVA 7B Q4_0	LLaVA	~4GB	5GB	Best

Performance Tips

Start with SmolVLM 256M — smallest multimodal model, fast and memory-efficient - Limit maxTokens — use 50-100 for quick descriptions, 300+ for detailed analysis - Cancel long generations — call cancelVLMGeneration() if the user navigates away - Unload when idle — VLM models consume significant memory; unload if the user switches to a different feature - Pre-copy images — save content URIs to temp files before starting VLM inference to avoid blocking on I/O during generation

LLM Generation

Text-only generation

LLM Streaming

Streaming text generation

Tool Calling

Function calling with LLMs

Best Practices

Performance optimization

Getting Started

Swift SDK

Kotlin SDK

React Native SDK

Flutter SDK

Web SDK

Vibe Coding

Overview

Package Imports

Basic Usage

Model Setup

Register a VLM Model

Load and Unload

API Reference

VLMImage

VLMGenerationOptions

processImageStream

cancelVLMGeneration

Model State

Model Registration

Examples

Jetpack Compose with Image Picker

Batch Image Description

Error Handling

Supported Models

Performance Tips

LLM Generation

LLM Streaming

Tool Calling

Best Practices

​Overview

​Package Imports

​Basic Usage

​Model Setup

​Register a VLM Model

​Load and Unload

​API Reference

​VLMImage

​VLMGenerationOptions

​processImageStream

​cancelVLMGeneration

​Model State

​Model Registration

​Examples

​Jetpack Compose with Image Picker

​Batch Image Description

​Error Handling

​Supported Models

​Performance Tips

​Related

LLM Generation

LLM Streaming

Tool Calling

Best Practices

Overview

Package Imports

Basic Usage

Model Setup

Register a VLM Model

Load and Unload

API Reference

VLMImage

VLMGenerationOptions

processImageStream

cancelVLMGeneration

Model State

Model Registration

Examples

Jetpack Compose with Image Picker

Batch Image Description

Error Handling

Supported Models

Performance Tips

Related