Skip to main content

Overview

The RAG (Retrieval-Augmented Generation) module enables fully on-device document Q&A. Ingest documents, generate embeddings, perform vector similarity search, and generate grounded answers — all without any data leaving the device.

Dependencies

Add the RAG module to your build:
dependencies {
    implementation("com.runanywhere.sdk:runanywhere-kotlin:0.1.5")
    implementation("com.runanywhere.sdk:runanywhere-core-rag:0.1.5")
}

Package Imports

import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.RAG.RAGConfiguration
import com.runanywhere.sdk.public.extensions.RAG.RAGQueryOptions
import com.runanywhere.sdk.public.extensions.RAG.RAGResult
import com.runanywhere.sdk.public.extensions.RAG.RAGSearchResult
import com.runanywhere.sdk.public.extensions.ragCreatePipeline
import com.runanywhere.sdk.public.extensions.ragDestroyPipeline
import com.runanywhere.sdk.public.extensions.ragIngest
import com.runanywhere.sdk.public.extensions.ragQuery
import com.runanywhere.sdk.public.extensions.ragClearDocuments
import com.runanywhere.sdk.public.extensions.ragDocumentCount

Basic Usage

// 1. Create a RAG pipeline
RunAnywhere.ragCreatePipeline(RAGConfiguration(
    embeddingModelPath = "/path/to/all-MiniLM-L6-v2.onnx",
    llmModelPath = "/path/to/qwen-0.5b.gguf",
    embeddingDimension = 384,
    topK = 3
))

// 2. Ingest documents
RunAnywhere.ragIngest("RunAnywhere is a privacy-first on-device AI platform.")
RunAnywhere.ragIngest("The SDK supports LLM, STT, TTS, VAD, VLM, and RAG.")
RunAnywhere.ragIngest("LoRA adapters can be hot-swapped at runtime.")

// 3. Query (options is optional — pass it only when you need non-default settings)
val result = RunAnywhere.ragQuery(
    question = "What AI features does RunAnywhere support?"
)

println("Answer: ${result.answer}")
println("Sources: ${result.retrievedChunks.size} chunks")

// 4. Cleanup
RunAnywhere.ragDestroyPipeline()

Pipeline Lifecycle

Create Pipeline

suspend fun RunAnywhere.ragCreatePipeline(config: RAGConfiguration)
Creates a RAG pipeline with an embedding model and LLM. The pipeline loads the embedding model immediately and initializes the vector index.

Destroy Pipeline

suspend fun RunAnywhere.ragDestroyPipeline()
Releases all resources (models, vector index, document store).
Always call ragDestroyPipeline() when you’re done to free memory. RAG pipelines load both an embedding model and potentially reference an LLM, which can consume significant memory.

Document Ingestion

Ingest Text

suspend fun RunAnywhere.ragIngest(text: String, metadataJson: String? = null)
Ingests a text document into the pipeline. The text is automatically chunked, embedded, and indexed.
// Simple ingestion
RunAnywhere.ragIngest("Your document text here...")

// With metadata for filtering/attribution
RunAnywhere.ragIngest(
    text = "Patient records indicate...",
    metadataJson = """{"source": "medical-docs", "page": 42, "date": "2025-01-15"}"""
)
ParameterTypeDescription
textStringDocument text to ingest
metadataJsonString?Optional JSON metadata attached to chunks

Clear Documents

suspend fun RunAnywhere.ragClearDocuments()
Removes all indexed documents from the pipeline.

Document Count

val RunAnywhere.ragDocumentCount: Int
Returns the number of indexed documents (synchronous property).

Querying

ragQuery

suspend fun RunAnywhere.ragQuery(
    question: String,
    options: RAGQueryOptions? = null
): RAGResult
Queries the pipeline: embeds the question, retrieves relevant chunks via vector search, builds a context, and generates a grounded answer using the LLM.
val result = RunAnywhere.ragQuery(
    question = "How does the privacy architecture work?",
    options = RAGQueryOptions(
        question = "How does the privacy architecture work?",
        maxTokens = 512,
        temperature = 0.7f,
        topK = 40
    )
)

// Access the answer
println(result.answer)

// Inspect retrieved sources
result.retrievedChunks.forEach { chunk ->
    println("Score: ${chunk.similarityScore}${chunk.text.take(100)}...")
}

// Check timing
println("Retrieval: ${result.retrievalTimeMs}ms")
println("Generation: ${result.generationTimeMs}ms")
println("Total: ${result.totalTimeMs}ms")

API Reference

RAGConfiguration

data class RAGConfiguration(
    val embeddingModelPath: String,
    val llmModelPath: String,
    val embeddingDimension: Int = 384,
    val topK: Int = 3,
    val similarityThreshold: Float = 0.15f,
    val maxContextTokens: Int = 2048,
    val chunkSize: Int = 512,
    val chunkOverlap: Int = 50,
    val promptTemplate: String? = null,
    val embeddingConfigJson: String? = null,
    val llmConfigJson: String? = null
)
ParameterTypeDefaultDescription
embeddingModelPathStringPath to ONNX embedding model
llmModelPathStringPath to GGUF LLM model
embeddingDimensionInt384Embedding vector dimension
topKInt3Number of chunks to retrieve
similarityThresholdFloat0.15Minimum cosine similarity (0.0-1.0)
maxContextTokensInt2048Max context tokens for LLM
chunkSizeInt512Tokens per chunk
chunkOverlapInt50Overlap tokens between chunks
promptTemplateString?nullCustom prompt with {context} and {query} placeholders
embeddingConfigJsonString?nullOptional embedding model config
llmConfigJsonString?nullOptional LLM config

RAGQueryOptions

data class RAGQueryOptions(
    val question: String,
    val systemPrompt: String? = null,
    val maxTokens: Int = 512,
    val temperature: Float = 0.7f,
    val topP: Float = 0.9f,
    val topK: Int = 40
)
ParameterTypeDefaultDescription
questionStringThe question to answer
systemPromptString?nullOverride system prompt
maxTokensInt512Max tokens to generate
temperatureFloat0.7Sampling temperature
topPFloat0.9Nucleus sampling
topKInt40Top-k sampling

RAGResult

data class RAGResult(
    val answer: String,
    val retrievedChunks: List<RAGSearchResult>,
    val contextUsed: String? = null,
    val retrievalTimeMs: Double,
    val generationTimeMs: Double,
    val totalTimeMs: Double
)

RAGSearchResult

data class RAGSearchResult(
    val chunkId: String,
    val text: String,
    val similarityScore: Float,
    val metadataJson: String? = null
)

Examples

Document Q&A with Sources

class DocumentQAViewModel : ViewModel() {
    private var isPipelineReady = false

    fun setupPipeline(embeddingPath: String, llmPath: String) {
        viewModelScope.launch {
            RunAnywhere.ragCreatePipeline(RAGConfiguration(
                embeddingModelPath = embeddingPath,
                llmModelPath = llmPath,
                topK = 5,
                chunkSize = 256,
                chunkOverlap = 30
            ))
            isPipelineReady = true
        }
    }

    fun ingestDocument(text: String, source: String) {
        viewModelScope.launch {
            RunAnywhere.ragIngest(
                text = text,
                metadataJson = """{"source": "$source"}"""
            )
            _docCount.value = RunAnywhere.ragDocumentCount
        }
    }

    fun askQuestion(question: String) {
        viewModelScope.launch {
            val result = RunAnywhere.ragQuery(question)

            _answer.value = result.answer
            _sources.value = result.retrievedChunks.map { chunk ->
                Source(
                    text = chunk.text.take(200),
                    score = chunk.similarityScore,
                    metadata = chunk.metadataJson
                )
            }
            _timing.value = "Retrieved in ${result.retrievalTimeMs.toInt()}ms, " +
                "generated in ${result.generationTimeMs.toInt()}ms"
        }
    }

    override fun onCleared() {
        super.onCleared()
        viewModelScope.launch {
            if (isPipelineReady) RunAnywhere.ragDestroyPipeline()
        }
    }
}

Custom Prompt Template

RunAnywhere.ragCreatePipeline(RAGConfiguration(
    embeddingModelPath = embeddingPath,
    llmModelPath = llmPath,
    promptTemplate = """You are a helpful assistant. Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {query}

Answer:"""
))

Error Handling

try {
    val result = RunAnywhere.ragQuery("What is RunAnywhere?")
    println(result.answer)
} catch (e: Exception) {
    when {
        e.message?.contains("pipeline") == true ->
            println("Create a pipeline first with ragCreatePipeline()")
        e.message?.contains("no documents") == true ->
            println("Ingest documents before querying")
        else ->
            println("RAG error: ${e.message}")
    }
}

Performance Tips

  • Choose the right embedding modelall-MiniLM-L6-v2 (384 dim) is a good default balance of speed and quality - Tune chunk size — smaller chunks (256) for precise retrieval, larger (512+) for broader context - Use chunk overlap — 10-20% overlap prevents losing context at chunk boundaries - Adjust similarity threshold — lower threshold (0.1) for more results, higher (0.5) for stricter relevance - topK = 3-5 is usually optimal — too many chunks dilute context quality - Destroy pipeline when done — RAG pipelines hold both an embedding model and vector index in memory

LLM Generation

Text generation API

LoRA Adapters

Fine-tune model behavior

Model Management

Register and download models

Best Practices

Performance optimization