The RAG (Retrieval-Augmented Generation) module enables fully on-device document Q&A. Ingest documents, generate embeddings, perform vector similarity search, and generate grounded answers — all without any data leaving the device.
// 1. Create a RAG pipelineRunAnywhere.ragCreatePipeline(RAGConfiguration( embeddingModelPath = "/path/to/all-MiniLM-L6-v2.onnx", llmModelPath = "/path/to/qwen-0.5b.gguf", embeddingDimension = 384, topK = 3))// 2. Ingest documentsRunAnywhere.ragIngest("RunAnywhere is a privacy-first on-device AI platform.")RunAnywhere.ragIngest("The SDK supports LLM, STT, TTS, VAD, VLM, and RAG.")RunAnywhere.ragIngest("LoRA adapters can be hot-swapped at runtime.")// 3. Query (options is optional — pass it only when you need non-default settings)val result = RunAnywhere.ragQuery( question = "What AI features does RunAnywhere support?")println("Answer: ${result.answer}")println("Sources: ${result.retrievedChunks.size} chunks")// 4. CleanupRunAnywhere.ragDestroyPipeline()
Releases all resources (models, vector index, document store).
Always call ragDestroyPipeline() when you’re done to free memory. RAG pipelines load both an
embedding model and potentially reference an LLM, which can consume significant memory.
suspend fun RunAnywhere.ragIngest(text: String, metadataJson: String? = null)
Ingests a text document into the pipeline. The text is automatically chunked, embedded, and indexed.
Copy
Ask AI
// Simple ingestionRunAnywhere.ragIngest("Your document text here...")// With metadata for filtering/attributionRunAnywhere.ragIngest( text = "Patient records indicate...", metadataJson = """{"source": "medical-docs", "page": 42, "date": "2025-01-15"}""")
suspend fun RunAnywhere.ragQuery( question: String, options: RAGQueryOptions? = null): RAGResult
Queries the pipeline: embeds the question, retrieves relevant chunks via vector search, builds a context, and generates a grounded answer using the LLM.
Copy
Ask AI
val result = RunAnywhere.ragQuery( question = "How does the privacy architecture work?", options = RAGQueryOptions( question = "How does the privacy architecture work?", maxTokens = 512, temperature = 0.7f, topK = 40 ))// Access the answerprintln(result.answer)// Inspect retrieved sourcesresult.retrievedChunks.forEach { chunk -> println("Score: ${chunk.similarityScore} — ${chunk.text.take(100)}...")}// Check timingprintln("Retrieval: ${result.retrievalTimeMs}ms")println("Generation: ${result.generationTimeMs}ms")println("Total: ${result.totalTimeMs}ms")
data class RAGConfiguration( val embeddingModelPath: String, val llmModelPath: String, val embeddingDimension: Int = 384, val topK: Int = 3, val similarityThreshold: Float = 0.15f, val maxContextTokens: Int = 2048, val chunkSize: Int = 512, val chunkOverlap: Int = 50, val promptTemplate: String? = null, val embeddingConfigJson: String? = null, val llmConfigJson: String? = null)
Parameter
Type
Default
Description
embeddingModelPath
String
—
Path to ONNX embedding model
llmModelPath
String
—
Path to GGUF LLM model
embeddingDimension
Int
384
Embedding vector dimension
topK
Int
3
Number of chunks to retrieve
similarityThreshold
Float
0.15
Minimum cosine similarity (0.0-1.0)
maxContextTokens
Int
2048
Max context tokens for LLM
chunkSize
Int
512
Tokens per chunk
chunkOverlap
Int
50
Overlap tokens between chunks
promptTemplate
String?
null
Custom prompt with {context} and {query} placeholders
data class RAGQueryOptions( val question: String, val systemPrompt: String? = null, val maxTokens: Int = 512, val temperature: Float = 0.7f, val topP: Float = 0.9f, val topK: Int = 40)
data class RAGResult( val answer: String, val retrievedChunks: List<RAGSearchResult>, val contextUsed: String? = null, val retrievalTimeMs: Double, val generationTimeMs: Double, val totalTimeMs: Double)
RunAnywhere.ragCreatePipeline(RAGConfiguration( embeddingModelPath = embeddingPath, llmModelPath = llmPath, promptTemplate = """You are a helpful assistant. Answer the question based ONLY on the provided context.If the context doesn't contain the answer, say "I don't have enough information."Context:{context}Question: {query}Answer:"""))
try { val result = RunAnywhere.ragQuery("What is RunAnywhere?") println(result.answer)} catch (e: Exception) { when { e.message?.contains("pipeline") == true -> println("Create a pipeline first with ragCreatePipeline()") e.message?.contains("no documents") == true -> println("Ingest documents before querying") else -> println("RAG error: ${e.message}") }}
Choose the right embedding model — all-MiniLM-L6-v2 (384 dim) is a good default balance of
speed and quality - Tune chunk size — smaller chunks (256) for precise retrieval, larger
(512+) for broader context - Use chunk overlap — 10-20% overlap prevents losing context at
chunk boundaries - Adjust similarity threshold — lower threshold (0.1) for more results,
higher (0.5) for stricter relevance - topK = 3-5 is usually optimal — too many chunks dilute
context quality - Destroy pipeline when done — RAG pipelines hold both an embedding model and
vector index in memory