RAG Pipeline - RunAnywhere Documentation

Overview

The RAG (Retrieval-Augmented Generation) module enables fully on-device document Q&A. Ingest documents, generate embeddings, perform vector similarity search, and generate grounded answers — all without any data leaving the device.

Dependencies

Add the RAG module to your build:

dependencies {
    implementation("com.runanywhere.sdk:runanywhere-kotlin:0.1.5")
    implementation("com.runanywhere.sdk:runanywhere-core-rag:0.1.5")
}

Package Imports

import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.RAG.RAGConfiguration
import com.runanywhere.sdk.public.extensions.RAG.RAGQueryOptions
import com.runanywhere.sdk.public.extensions.RAG.RAGResult
import com.runanywhere.sdk.public.extensions.RAG.RAGSearchResult
import com.runanywhere.sdk.public.extensions.ragCreatePipeline
import com.runanywhere.sdk.public.extensions.ragDestroyPipeline
import com.runanywhere.sdk.public.extensions.ragIngest
import com.runanywhere.sdk.public.extensions.ragQuery
import com.runanywhere.sdk.public.extensions.ragClearDocuments
import com.runanywhere.sdk.public.extensions.ragDocumentCount

Basic Usage

// 1. Create a RAG pipeline
RunAnywhere.ragCreatePipeline(RAGConfiguration(
    embeddingModelPath = "/path/to/all-MiniLM-L6-v2.onnx",
    llmModelPath = "/path/to/qwen-0.5b.gguf",
    embeddingDimension = 384,
    topK = 3
))

// 2. Ingest documents
RunAnywhere.ragIngest("RunAnywhere is a privacy-first on-device AI platform.")
RunAnywhere.ragIngest("The SDK supports LLM, STT, TTS, VAD, VLM, and RAG.")
RunAnywhere.ragIngest("LoRA adapters can be hot-swapped at runtime.")

// 3. Query (options is optional — pass it only when you need non-default settings)
val result = RunAnywhere.ragQuery(
    question = "What AI features does RunAnywhere support?"
)

println("Answer: ${result.answer}")
println("Sources: ${result.retrievedChunks.size} chunks")

// 4. Cleanup
RunAnywhere.ragDestroyPipeline()

Pipeline Lifecycle

Create Pipeline

suspend fun RunAnywhere.ragCreatePipeline(config: RAGConfiguration)

Creates a RAG pipeline with an embedding model and LLM. The pipeline loads the embedding model immediately and initializes the vector index.

Destroy Pipeline

suspend fun RunAnywhere.ragDestroyPipeline()

Releases all resources (models, vector index, document store).

Always call ragDestroyPipeline() when you’re done to free memory. RAG pipelines load both an embedding model and potentially reference an LLM, which can consume significant memory.

Document Ingestion

Ingest Text

suspend fun RunAnywhere.ragIngest(text: String, metadataJson: String? = null)

Ingests a text document into the pipeline. The text is automatically chunked, embedded, and indexed.

// Simple ingestion
RunAnywhere.ragIngest("Your document text here...")

// With metadata for filtering/attribution
RunAnywhere.ragIngest(
    text = "Patient records indicate...",
    metadataJson = """{"source": "medical-docs", "page": 42, "date": "2025-01-15"}"""
)

Parameter	Type	Description
`text`	`String`	Document text to ingest
`metadataJson`	`String?`	Optional JSON metadata attached to chunks

Clear Documents

suspend fun RunAnywhere.ragClearDocuments()

Removes all indexed documents from the pipeline.

Document Count

val RunAnywhere.ragDocumentCount: Int

Returns the number of indexed documents (synchronous property).

Querying

ragQuery

suspend fun RunAnywhere.ragQuery(
    question: String,
    options: RAGQueryOptions? = null
): RAGResult

Queries the pipeline: embeds the question, retrieves relevant chunks via vector search, builds a context, and generates a grounded answer using the LLM.

val result = RunAnywhere.ragQuery(
    question = "How does the privacy architecture work?",
    options = RAGQueryOptions(
        question = "How does the privacy architecture work?",
        maxTokens = 512,
        temperature = 0.7f,
        topK = 40
    )
)

// Access the answer
println(result.answer)

// Inspect retrieved sources
result.retrievedChunks.forEach { chunk ->
    println("Score: ${chunk.similarityScore} — ${chunk.text.take(100)}...")
}

// Check timing
println("Retrieval: ${result.retrievalTimeMs}ms")
println("Generation: ${result.generationTimeMs}ms")
println("Total: ${result.totalTimeMs}ms")

API Reference

RAGConfiguration

data class RAGConfiguration(
    val embeddingModelPath: String,
    val llmModelPath: String,
    val embeddingDimension: Int = 384,
    val topK: Int = 3,
    val similarityThreshold: Float = 0.15f,
    val maxContextTokens: Int = 2048,
    val chunkSize: Int = 512,
    val chunkOverlap: Int = 50,
    val promptTemplate: String? = null,
    val embeddingConfigJson: String? = null,
    val llmConfigJson: String? = null
)

Parameter	Type	Default	Description
`embeddingModelPath`	`String`	—	Path to ONNX embedding model
`llmModelPath`	`String`	—	Path to GGUF LLM model
`embeddingDimension`	`Int`	`384`	Embedding vector dimension
`topK`	`Int`	`3`	Number of chunks to retrieve
`similarityThreshold`	`Float`	`0.15`	Minimum cosine similarity (0.0-1.0)
`maxContextTokens`	`Int`	`2048`	Max context tokens for LLM
`chunkSize`	`Int`	`512`	Tokens per chunk
`chunkOverlap`	`Int`	`50`	Overlap tokens between chunks
`promptTemplate`	`String?`	`null`	Custom prompt with `{context}` and `{query}` placeholders
`embeddingConfigJson`	`String?`	`null`	Optional embedding model config
`llmConfigJson`	`String?`	`null`	Optional LLM config

RAGQueryOptions

data class RAGQueryOptions(
    val question: String,
    val systemPrompt: String? = null,
    val maxTokens: Int = 512,
    val temperature: Float = 0.7f,
    val topP: Float = 0.9f,
    val topK: Int = 40
)

Parameter	Type	Default	Description
`question`	`String`	—	The question to answer
`systemPrompt`	`String?`	`null`	Override system prompt
`maxTokens`	`Int`	`512`	Max tokens to generate
`temperature`	`Float`	`0.7`	Sampling temperature
`topP`	`Float`	`0.9`	Nucleus sampling
`topK`	`Int`	`40`	Top-k sampling

RAGResult

data class RAGResult(
    val answer: String,
    val retrievedChunks: List<RAGSearchResult>,
    val contextUsed: String? = null,
    val retrievalTimeMs: Double,
    val generationTimeMs: Double,
    val totalTimeMs: Double
)

RAGSearchResult

data class RAGSearchResult(
    val chunkId: String,
    val text: String,
    val similarityScore: Float,
    val metadataJson: String? = null
)

Examples

Document Q&A with Sources

class DocumentQAViewModel : ViewModel() {
    private var isPipelineReady = false

    fun setupPipeline(embeddingPath: String, llmPath: String) {
        viewModelScope.launch {
            RunAnywhere.ragCreatePipeline(RAGConfiguration(
                embeddingModelPath = embeddingPath,
                llmModelPath = llmPath,
                topK = 5,
                chunkSize = 256,
                chunkOverlap = 30
            ))
            isPipelineReady = true
        }
    }

    fun ingestDocument(text: String, source: String) {
        viewModelScope.launch {
            RunAnywhere.ragIngest(
                text = text,
                metadataJson = """{"source": "$source"}"""
            )
            _docCount.value = RunAnywhere.ragDocumentCount
        }
    }

    fun askQuestion(question: String) {
        viewModelScope.launch {
            val result = RunAnywhere.ragQuery(question)

            _answer.value = result.answer
            _sources.value = result.retrievedChunks.map { chunk ->
                Source(
                    text = chunk.text.take(200),
                    score = chunk.similarityScore,
                    metadata = chunk.metadataJson
                )
            }
            _timing.value = "Retrieved in ${result.retrievalTimeMs.toInt()}ms, " +
                "generated in ${result.generationTimeMs.toInt()}ms"
        }
    }

    override fun onCleared() {
        super.onCleared()
        viewModelScope.launch {
            if (isPipelineReady) RunAnywhere.ragDestroyPipeline()
        }
    }
}

Custom Prompt Template

RunAnywhere.ragCreatePipeline(RAGConfiguration(
    embeddingModelPath = embeddingPath,
    llmModelPath = llmPath,
    promptTemplate = """You are a helpful assistant. Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {query}

Answer:"""
))

Error Handling

try {
    val result = RunAnywhere.ragQuery("What is RunAnywhere?")
    println(result.answer)
} catch (e: Exception) {
    when {
        e.message?.contains("pipeline") == true ->
            println("Create a pipeline first with ragCreatePipeline()")
        e.message?.contains("no documents") == true ->
            println("Ingest documents before querying")
        else ->
            println("RAG error: ${e.message}")
    }
}

Performance Tips

Choose the right embedding model — all-MiniLM-L6-v2 (384 dim) is a good default balance of speed and quality - Tune chunk size — smaller chunks (256) for precise retrieval, larger (512+) for broader context - Use chunk overlap — 10-20% overlap prevents losing context at chunk boundaries - Adjust similarity threshold — lower threshold (0.1) for more results, higher (0.5) for stricter relevance - topK = 3-5 is usually optimal — too many chunks dilute context quality - Destroy pipeline when done — RAG pipelines hold both an embedding model and vector index in memory

LLM Generation

Text generation API

LoRA Adapters

Fine-tune model behavior

Model Management

Best Practices

Performance optimization

​Overview

​Dependencies

​Package Imports

​Basic Usage

​Pipeline Lifecycle

​Create Pipeline

​Destroy Pipeline

​Document Ingestion

​Ingest Text

​Clear Documents

​Document Count

​Querying

​ragQuery

​API Reference

​RAGConfiguration

​RAGQueryOptions

​RAGResult

​RAGSearchResult

​Examples

​Document Q&A with Sources

​Custom Prompt Template

​Error Handling

​Performance Tips

​Related

LLM Generation

LoRA Adapters

Model Management

Best Practices

Overview

Dependencies

Package Imports

Basic Usage

Pipeline Lifecycle

Create Pipeline

Destroy Pipeline

Document Ingestion

Ingest Text

Clear Documents

Document Count

Querying

ragQuery

API Reference

RAGConfiguration

RAGQueryOptions

RAGResult

RAGSearchResult

Examples

Document Q&A with Sources

Custom Prompt Template

Error Handling

Performance Tips

Related