Overview
The VLM (Vision Language Model) module enables on-device multimodal inference — feed an image and a text prompt to get descriptions, answers, or analysis. VLM runs entirely on the device using llama.cpp with no data leaving the phone.
Package Imports
import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.VLM.VLMGenerationOptions
import com.runanywhere.sdk.public.extensions.VLM.VLMImage
import com.runanywhere.sdk.public.extensions.cancelVLMGeneration
import com.runanywhere.sdk.public.extensions.processImageStream
Model registration types come from the core SDK:
import com.runanywhere.sdk.public.ModelFileDescriptor
import com.runanywhere.sdk.public.InferenceFramework
import com.runanywhere.sdk.public.ModelCategory
Basic Usage
val image = VLMImage.fromFilePath("/data/user/0/com.example/cache/photo.jpg")
val options = VLMGenerationOptions(maxTokens = 300)
var description = ""
RunAnywhere.processImageStream(image, "Describe this image.", options)
.collect { token ->
description += token
updateUI(description)
}
processImageStream returns a Flow<String> where each emission is a single token, enabling real-time streaming display.
Model Setup
Register a VLM Model
VLM models require two files: the main model GGUF and a multimodal projector (mmproj) GGUF. Register them using registerMultiFileModel:
RunAnywhere.registerMultiFileModel(
id = "smolvlm-256m-instruct",
name = "SmolVLM 256M Instruct",
files = listOf(
ModelFileDescriptor(
url = "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/SmolVLM-256M-Instruct-Q8_0.gguf",
filename = "SmolVLM-256M-Instruct-Q8_0.gguf"
),
ModelFileDescriptor(
url = "https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-256M-Instruct-f16.gguf",
filename = "mmproj-SmolVLM-256M-Instruct-f16.gguf"
),
),
framework = InferenceFramework.LLAMA_CPP,
modality = ModelCategory.MULTIMODAL,
memoryRequirement = 365_000_000L
)
Load and Unload
RunAnywhere.loadVLMModel("smolvlm-256m-instruct")
// Check model state
if (RunAnywhere.isVLMModelLoaded) {
// Ready for inference
}
// Unload when done
try {
RunAnywhere.unloadVLMModel()
} catch (e: Exception) {
// Safe to ignore — model may already be unloaded
}
unloadVLMModel() can throw if the model is not currently loaded. Always wrap it in a try/catch
block.
API Reference
VLMImage
| Factory Method | Parameters | Description |
|---|
VLMImage.fromFilePath(path) | path: String | Creates a VLM image from an on-disk file path |
VLMImage.fromFilePath() requires a file system path, not a content URI. If you’re using
Android’s photo picker or ACTION_OPEN_DOCUMENT, you must first copy the image to a temporary
file. See the image picker example below.
VLMGenerationOptions
data class VLMGenerationOptions(
val maxTokens: Int = 512
)
| Parameter | Type | Default | Description |
|---|
maxTokens | Int | 512 | Maximum number of tokens to generate |
processImageStream
fun RunAnywhere.processImageStream(
image: VLMImage,
prompt: String,
options: VLMGenerationOptions
): Flow<String>
Returns a Flow<String> that emits one token at a time. Collect the flow to build up the full response incrementally.
| Parameter | Type | Description |
|---|
image | VLMImage | Image to analyze |
prompt | String | Text prompt for the model |
options | VLMGenerationOptions | Generation configuration |
cancelVLMGeneration
fun RunAnywhere.cancelVLMGeneration()
Cancels the current VLM generation. Safe to call even if no generation is in progress.
Model State
| Property / Method | Type | Description |
|---|
RunAnywhere.isVLMModelLoaded | Boolean (property) | Whether a VLM model is currently loaded |
RunAnywhere.loadVLMModel(id) | suspend | Loads a registered VLM model by ID |
RunAnywhere.unloadVLMModel() | suspend | Unloads the current VLM model (can throw) |
isVLMModelLoaded is a property, not a suspend function. Unlike other model checks in the SDK (e.g., isLLMModelLoaded()), you access it directly without calling it.// Correct
if (RunAnywhere.isVLMModelLoaded) { ... }
// Incorrect — will not compile
if (RunAnywhere.isVLMModelLoaded()) { ... }
Model Registration
fun RunAnywhere.registerMultiFileModel(
id: String,
name: String,
files: List<ModelFileDescriptor>,
framework: InferenceFramework,
modality: ModelCategory,
memoryRequirement: Long
)
| Parameter | Type | Description |
|---|
id | String | Unique model identifier |
name | String | Human-readable model name |
files | List<ModelFileDescriptor> | Model GGUF + mmproj GGUF (order matters) |
framework | InferenceFramework | Must be InferenceFramework.LLAMA_CPP for VLM |
modality | ModelCategory | Must be ModelCategory.MULTIMODAL for VLM |
memoryRequirement | Long | Estimated memory usage in bytes |
Examples
Jetpack Compose with Image Picker
A complete example showing image selection, VLM processing, and streaming token display:
@Composable
fun VLMScreen() {
val context = LocalContext.current
var selectedImageUri by remember { mutableStateOf<Uri?>(null) }
var description by remember { mutableStateOf("") }
var isProcessing by remember { mutableStateOf(false) }
var prompt by remember { mutableStateOf("Describe this image in detail.") }
val scope = rememberCoroutineScope()
val imagePicker = rememberLauncherForActivityResult(
contract = ActivityResultContracts.PickVisualMedia()
) { uri -> selectedImageUri = uri }
Column(
modifier = Modifier
.fillMaxSize()
.padding(16.dp),
verticalArrangement = Arrangement.spacedBy(12.dp)
) {
selectedImageUri?.let { uri ->
AsyncImage(
model = uri,
contentDescription = "Selected image",
modifier = Modifier
.fillMaxWidth()
.height(250.dp)
.clip(RoundedCornerShape(12.dp)),
contentScale = ContentScale.Crop
)
}
OutlinedTextField(
value = prompt,
onValueChange = { prompt = it },
label = { Text("Prompt") },
modifier = Modifier.fillMaxWidth()
)
Row(horizontalArrangement = Arrangement.spacedBy(8.dp)) {
Button(onClick = {
imagePicker.launch(PickVisualMediaRequest(ActivityResultContracts.PickVisualMedia.ImageOnly))
}) {
Text("Pick Image")
}
Button(
onClick = {
val uri = selectedImageUri ?: return@Button
scope.launch {
isProcessing = true
description = ""
val tempFile = saveUriToTempFile(context, uri)
if (tempFile != null) {
val vlmImage = VLMImage.fromFilePath(tempFile.absolutePath)
val options = VLMGenerationOptions(maxTokens = 300)
RunAnywhere.processImageStream(vlmImage, prompt, options)
.collect { token -> description += token }
} else {
description = "Error: Could not load image."
}
isProcessing = false
}
},
enabled = selectedImageUri != null && !isProcessing
&& RunAnywhere.isVLMModelLoaded
) {
Text(if (isProcessing) "Analyzing..." else "Describe")
}
if (isProcessing) {
Button(onClick = { RunAnywhere.cancelVLMGeneration() }) {
Text("Cancel")
}
}
}
if (description.isNotEmpty()) {
Text(
text = description,
modifier = Modifier
.fillMaxWidth()
.verticalScroll(rememberScrollState())
)
}
}
}
private fun saveUriToTempFile(context: Context, uri: Uri): File? {
return try {
val inputStream = context.contentResolver.openInputStream(uri) ?: return null
val tempFile = File.createTempFile("vlm_", ".jpg", context.cacheDir)
tempFile.outputStream().use { output -> inputStream.copyTo(output) }
inputStream.close()
tempFile
} catch (e: Exception) {
null
}
}
The saveUriToTempFile helper is essential when using Android’s photo picker. Content URIs from
PickVisualMedia cannot be passed directly to VLMImage.fromFilePath() — you must first write
the image to a temporary JPEG file on disk.
Batch Image Description
suspend fun describeImages(paths: List<String>): List<String> {
val options = VLMGenerationOptions(maxTokens = 100)
return paths.map { path ->
var result = ""
val image = VLMImage.fromFilePath(path)
RunAnywhere.processImageStream(image, "Describe briefly.", options)
.collect { token -> result += token }
result
}
}
Error Handling
try {
RunAnywhere.processImageStream(vlmImage, prompt, options)
.collect { token -> description += token }
} catch (e: Exception) {
when {
e.message?.contains("model not loaded") == true -> {
// Load the model first
RunAnywhere.loadVLMModel("smolvlm-256m-instruct")
}
e.message?.contains("file not found") == true -> {
// Image path is invalid
showError("Could not find image file")
}
else -> {
showError("VLM error: ${e.message}")
}
}
}
Supported Models
| Model | Architecture | Size | Memory | Quality |
|---|
| SmolVLM 256M Q8_0 | SmolVLM | ~300MB | 365MB | Good — ultra-compact, fastest |
| SmolVLM 500M Q8_0 | SmolVLM | ~500MB | 550MB | Good |
| Qwen2-VL 2B Q4_K_M | Qwen2VL | ~1.5GB | 1.8GB | Better |
| LLaVA 7B Q4_0 | LLaVA | ~4GB | 5GB | Best |
- Start with SmolVLM 256M — smallest multimodal model, fast and memory-efficient - Limit
maxTokens — use 50-100 for quick descriptions, 300+ for detailed analysis - Cancel long
generations — call cancelVLMGeneration() if the user navigates away - Unload when idle —
VLM models consume significant memory; unload if the user switches to a different feature -
Pre-copy images — save content URIs to temp files before starting VLM inference to avoid
blocking on I/O during generation