Skip to main content

Overview

The VLM (Vision Language Model) module enables on-device multimodal inference — feed an image and a text prompt to get descriptions, answers, or analysis. VLM runs entirely on the device using llama.cpp with no data leaving the phone.

Package Imports

import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.VLM.VLMGenerationOptions
import com.runanywhere.sdk.public.extensions.VLM.VLMImage
import com.runanywhere.sdk.public.extensions.cancelVLMGeneration
import com.runanywhere.sdk.public.extensions.processImageStream
Model registration types come from the core SDK:
import com.runanywhere.sdk.public.ModelFileDescriptor
import com.runanywhere.sdk.public.InferenceFramework
import com.runanywhere.sdk.public.ModelCategory

Basic Usage

val image = VLMImage.fromFilePath("/data/user/0/com.example/cache/photo.jpg")
val options = VLMGenerationOptions(maxTokens = 300)

var description = ""
RunAnywhere.processImageStream(image, "Describe this image.", options)
    .collect { token ->
        description += token
        updateUI(description)
    }
processImageStream returns a Flow<String> where each emission is a single token, enabling real-time streaming display.

Model Setup

Register a VLM Model

VLM models require two files: the main model GGUF and a multimodal projector (mmproj) GGUF. Register them using registerMultiFileModel:
RunAnywhere.registerMultiFileModel(
    id = "smolvlm-256m-instruct",
    name = "SmolVLM 256M Instruct",
    files = listOf(
        ModelFileDescriptor(
            url = "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/SmolVLM-256M-Instruct-Q8_0.gguf",
            filename = "SmolVLM-256M-Instruct-Q8_0.gguf"
        ),
        ModelFileDescriptor(
            url = "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-256M-Instruct-f16.gguf",
            filename = "mmproj-SmolVLM-256M-Instruct-f16.gguf"
        ),
    ),
    framework = InferenceFramework.LLAMA_CPP,
    modality = ModelCategory.MULTIMODAL,
    memoryRequirement = 365_000_000L
)

Load and Unload

RunAnywhere.loadVLMModel("smolvlm-256m-instruct")

// Check model state
if (RunAnywhere.isVLMModelLoaded) {
    // Ready for inference
}

// Unload when done
try {
    RunAnywhere.unloadVLMModel()
} catch (e: Exception) {
    // Safe to ignore — model may already be unloaded
}
unloadVLMModel() can throw if the model is not currently loaded. Always wrap it in a try/catch block.

API Reference

VLMImage

Factory MethodParametersDescription
VLMImage.fromFilePath(path)path: StringCreates a VLM image from an on-disk file path
VLMImage.fromFilePath() requires a file system path, not a content URI. If you’re using Android’s photo picker or ACTION_OPEN_DOCUMENT, you must first copy the image to a temporary file. See the image picker example below.

VLMGenerationOptions

data class VLMGenerationOptions(
    val maxTokens: Int = 512
)
ParameterTypeDefaultDescription
maxTokensInt512Maximum number of tokens to generate

processImageStream

fun RunAnywhere.processImageStream(
    image: VLMImage,
    prompt: String,
    options: VLMGenerationOptions
): Flow<String>
Returns a Flow<String> that emits one token at a time. Collect the flow to build up the full response incrementally.
ParameterTypeDescription
imageVLMImageImage to analyze
promptStringText prompt for the model
optionsVLMGenerationOptionsGeneration configuration

cancelVLMGeneration

fun RunAnywhere.cancelVLMGeneration()
Cancels the current VLM generation. Safe to call even if no generation is in progress.

Model State

Property / MethodTypeDescription
RunAnywhere.isVLMModelLoadedBoolean (property)Whether a VLM model is currently loaded
RunAnywhere.loadVLMModel(id)suspendLoads a registered VLM model by ID
RunAnywhere.unloadVLMModel()suspendUnloads the current VLM model (can throw)
isVLMModelLoaded is a property, not a suspend function. Unlike other model checks in the SDK (e.g., isLLMModelLoaded()), you access it directly without calling it.
// Correct
if (RunAnywhere.isVLMModelLoaded) { ... }

// Incorrect — will not compile
if (RunAnywhere.isVLMModelLoaded()) { ... }

Model Registration

fun RunAnywhere.registerMultiFileModel(
    id: String,
    name: String,
    files: List<ModelFileDescriptor>,
    framework: InferenceFramework,
    modality: ModelCategory,
    memoryRequirement: Long
)
ParameterTypeDescription
idStringUnique model identifier
nameStringHuman-readable model name
filesList<ModelFileDescriptor>Model GGUF + mmproj GGUF (order matters)
frameworkInferenceFrameworkMust be InferenceFramework.LLAMA_CPP for VLM
modalityModelCategoryMust be ModelCategory.MULTIMODAL for VLM
memoryRequirementLongEstimated memory usage in bytes

Examples

Jetpack Compose with Image Picker

A complete example showing image selection, VLM processing, and streaming token display:
@Composable
fun VLMScreen() {
    val context = LocalContext.current
    var selectedImageUri by remember { mutableStateOf<Uri?>(null) }
    var description by remember { mutableStateOf("") }
    var isProcessing by remember { mutableStateOf(false) }
    var prompt by remember { mutableStateOf("Describe this image in detail.") }
    val scope = rememberCoroutineScope()

    val imagePicker = rememberLauncherForActivityResult(
        contract = ActivityResultContracts.PickVisualMedia()
    ) { uri -> selectedImageUri = uri }

    Column(
        modifier = Modifier
            .fillMaxSize()
            .padding(16.dp),
        verticalArrangement = Arrangement.spacedBy(12.dp)
    ) {
        selectedImageUri?.let { uri ->
            AsyncImage(
                model = uri,
                contentDescription = "Selected image",
                modifier = Modifier
                    .fillMaxWidth()
                    .height(250.dp)
                    .clip(RoundedCornerShape(12.dp)),
                contentScale = ContentScale.Crop
            )
        }

        OutlinedTextField(
            value = prompt,
            onValueChange = { prompt = it },
            label = { Text("Prompt") },
            modifier = Modifier.fillMaxWidth()
        )

        Row(horizontalArrangement = Arrangement.spacedBy(8.dp)) {
            Button(onClick = {
                imagePicker.launch(PickVisualMediaRequest(ActivityResultContracts.PickVisualMedia.ImageOnly))
            }) {
                Text("Pick Image")
            }

            Button(
                onClick = {
                    val uri = selectedImageUri ?: return@Button
                    scope.launch {
                        isProcessing = true
                        description = ""

                        val tempFile = saveUriToTempFile(context, uri)
                        if (tempFile != null) {
                            val vlmImage = VLMImage.fromFilePath(tempFile.absolutePath)
                            val options = VLMGenerationOptions(maxTokens = 300)

                            RunAnywhere.processImageStream(vlmImage, prompt, options)
                                .collect { token -> description += token }
                        } else {
                            description = "Error: Could not load image."
                        }

                        isProcessing = false
                    }
                },
                enabled = selectedImageUri != null && !isProcessing
                    && RunAnywhere.isVLMModelLoaded
            ) {
                Text(if (isProcessing) "Analyzing..." else "Describe")
            }

            if (isProcessing) {
                Button(onClick = { RunAnywhere.cancelVLMGeneration() }) {
                    Text("Cancel")
                }
            }
        }

        if (description.isNotEmpty()) {
            Text(
                text = description,
                modifier = Modifier
                    .fillMaxWidth()
                    .verticalScroll(rememberScrollState())
            )
        }
    }
}

private fun saveUriToTempFile(context: Context, uri: Uri): File? {
    return try {
        val inputStream = context.contentResolver.openInputStream(uri) ?: return null
        val tempFile = File.createTempFile("vlm_", ".jpg", context.cacheDir)
        tempFile.outputStream().use { output -> inputStream.copyTo(output) }
        inputStream.close()
        tempFile
    } catch (e: Exception) {
        null
    }
}
The saveUriToTempFile helper is essential when using Android’s photo picker. Content URIs from PickVisualMedia cannot be passed directly to VLMImage.fromFilePath() — you must first write the image to a temporary JPEG file on disk.

Batch Image Description

suspend fun describeImages(paths: List<String>): List<String> {
    val options = VLMGenerationOptions(maxTokens = 100)

    return paths.map { path ->
        var result = ""
        val image = VLMImage.fromFilePath(path)
        RunAnywhere.processImageStream(image, "Describe briefly.", options)
            .collect { token -> result += token }
        result
    }
}

Error Handling

try {
    RunAnywhere.processImageStream(vlmImage, prompt, options)
        .collect { token -> description += token }
} catch (e: Exception) {
    when {
        e.message?.contains("model not loaded") == true -> {
            // Load the model first
            RunAnywhere.loadVLMModel("smolvlm-256m-instruct")
        }
        e.message?.contains("file not found") == true -> {
            // Image path is invalid
            showError("Could not find image file")
        }
        else -> {
            showError("VLM error: ${e.message}")
        }
    }
}

Supported Models

ModelArchitectureSizeMemoryQuality
SmolVLM 256M Q8_0SmolVLM~300MB365MBGood — ultra-compact, fastest
SmolVLM 500M Q8_0SmolVLM~500MB550MBGood
Qwen2-VL 2B Q4_K_MQwen2VL~1.5GB1.8GBBetter
LLaVA 7B Q4_0LLaVA~4GB5GBBest

Performance Tips

  • Start with SmolVLM 256M — smallest multimodal model, fast and memory-efficient - Limit maxTokens — use 50-100 for quick descriptions, 300+ for detailed analysis - Cancel long generations — call cancelVLMGeneration() if the user navigates away - Unload when idle — VLM models consume significant memory; unload if the user switches to a different feature - Pre-copy images — save content URIs to temp files before starting VLM inference to avoid blocking on I/O during generation