The VLM (Vision Language Model) module enables on-device multimodal inference — feed an image and a text prompt to get descriptions, answers, or analysis. VLM runs entirely on the device using llama.cpp with no data leaving the phone.
RunAnywhere.loadVLMModel("smolvlm-256m-instruct")// Check model stateif (RunAnywhere.isVLMModelLoaded) { // Ready for inference}// Unload when donetry { RunAnywhere.unloadVLMModel()} catch (e: Exception) { // Safe to ignore — model may already be unloaded}
unloadVLMModel() can throw if the model is not currently loaded. Always wrap it in a try/catch
block.
VLMImage.fromFilePath() requires a file system path, not a content URI. If you’re using
Android’s photo picker or ACTION_OPEN_DOCUMENT, you must first copy the image to a temporary
file. See the image picker example below.
isVLMModelLoaded is a property, not a suspend function. Unlike other model checks in the SDK (e.g., isLLMModelLoaded()), you access it directly without calling it.
// Correctif (RunAnywhere.isVLMModelLoaded) { ... }// Incorrect — will not compileif (RunAnywhere.isVLMModelLoaded()) { ... }
A complete example showing image selection, VLM processing, and streaming token display:
@Composablefun VLMScreen() { val context = LocalContext.current var selectedImageUri by remember { mutableStateOf<Uri?>(null) } var description by remember { mutableStateOf("") } var isProcessing by remember { mutableStateOf(false) } var prompt by remember { mutableStateOf("Describe this image in detail.") } val scope = rememberCoroutineScope() val imagePicker = rememberLauncherForActivityResult( contract = ActivityResultContracts.PickVisualMedia() ) { uri -> selectedImageUri = uri } Column( modifier = Modifier .fillMaxSize() .padding(16.dp), verticalArrangement = Arrangement.spacedBy(12.dp) ) { selectedImageUri?.let { uri -> AsyncImage( model = uri, contentDescription = "Selected image", modifier = Modifier .fillMaxWidth() .height(250.dp) .clip(RoundedCornerShape(12.dp)), contentScale = ContentScale.Crop ) } OutlinedTextField( value = prompt, onValueChange = { prompt = it }, label = { Text("Prompt") }, modifier = Modifier.fillMaxWidth() ) Row(horizontalArrangement = Arrangement.spacedBy(8.dp)) { Button(onClick = { imagePicker.launch(PickVisualMediaRequest(ActivityResultContracts.PickVisualMedia.ImageOnly)) }) { Text("Pick Image") } Button( onClick = { val uri = selectedImageUri ?: return@Button scope.launch { isProcessing = true description = "" val tempFile = saveUriToTempFile(context, uri) if (tempFile != null) { val vlmImage = VLMImage.fromFilePath(tempFile.absolutePath) val options = VLMGenerationOptions(maxTokens = 300) RunAnywhere.processImageStream(vlmImage, prompt, options) .collect { token -> description += token } } else { description = "Error: Could not load image." } isProcessing = false } }, enabled = selectedImageUri != null && !isProcessing && RunAnywhere.isVLMModelLoaded ) { Text(if (isProcessing) "Analyzing..." else "Describe") } if (isProcessing) { Button(onClick = { RunAnywhere.cancelVLMGeneration() }) { Text("Cancel") } } } if (description.isNotEmpty()) { Text( text = description, modifier = Modifier .fillMaxWidth() .verticalScroll(rememberScrollState()) ) } }}private fun saveUriToTempFile(context: Context, uri: Uri): File? { return try { val inputStream = context.contentResolver.openInputStream(uri) ?: return null val tempFile = File.createTempFile("vlm_", ".jpg", context.cacheDir) tempFile.outputStream().use { output -> inputStream.copyTo(output) } inputStream.close() tempFile } catch (e: Exception) { null }}
The saveUriToTempFile helper is essential when using Android’s photo picker. Content URIs from
PickVisualMedia cannot be passed directly to VLMImage.fromFilePath() — you must first write
the image to a temporary JPEG file on disk.
Start with SmolVLM 256M — smallest multimodal model, fast and memory-efficient - Limit
maxTokens — use 50-100 for quick descriptions, 300+ for detailed analysis - Cancel long
generations — call cancelVLMGeneration() if the user navigates away - Unload when idle —
VLM models consume significant memory; unload if the user switches to a different feature -
Pre-copy images — save content URIs to temp files before starting VLM inference to avoid
blocking on I/O during generation