> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runanywhere.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# RAG Pipeline

> On-device retrieval-augmented generation with vector search

## Overview

The RAG (Retrieval-Augmented Generation) module enables fully on-device document Q\&A. Ingest documents, generate embeddings, perform vector similarity search, and generate grounded answers — all without any data leaving the device.

```mermaid theme={null}
graph LR
    A(Documents) --> B(Ingest + Chunk)
    B --> C(Embed + Index)
    D(Question) --> E(Embed Query)
    E --> F(Vector Search)
    C --> F
    F --> G(Retrieve Chunks)
    G --> H(LLM Generate)
    H --> I(Grounded Answer)

    style A fill:#334155,color:#fff,stroke:#334155
    style D fill:#334155,color:#fff,stroke:#334155
    style F fill:#ff6900,color:#fff,stroke:#ff6900
    style H fill:#fb2c36,color:#fff,stroke:#fb2c36
    style I fill:#334155,color:#fff,stroke:#334155
```

## Dependencies

Add the RAG module to your build:

```kotlin theme={null}
dependencies {
    implementation("com.runanywhere.sdk:runanywhere-kotlin:0.1.5")
    implementation("com.runanywhere.sdk:runanywhere-core-rag:0.1.5")
}
```

## Package Imports

```kotlin theme={null}
import com.runanywhere.sdk.public.RunAnywhere
import com.runanywhere.sdk.public.extensions.RAG.RAGConfiguration
import com.runanywhere.sdk.public.extensions.RAG.RAGQueryOptions
import com.runanywhere.sdk.public.extensions.RAG.RAGResult
import com.runanywhere.sdk.public.extensions.RAG.RAGSearchResult
import com.runanywhere.sdk.public.extensions.ragCreatePipeline
import com.runanywhere.sdk.public.extensions.ragDestroyPipeline
import com.runanywhere.sdk.public.extensions.ragIngest
import com.runanywhere.sdk.public.extensions.ragQuery
import com.runanywhere.sdk.public.extensions.ragClearDocuments
import com.runanywhere.sdk.public.extensions.ragDocumentCount
```

## Basic Usage

```kotlin theme={null}
// 1. Create a RAG pipeline
RunAnywhere.ragCreatePipeline(RAGConfiguration(
    embeddingModelPath = "/path/to/all-MiniLM-L6-v2.onnx",
    llmModelPath = "/path/to/qwen-0.5b.gguf",
    embeddingDimension = 384,
    topK = 3
))

// 2. Ingest documents
RunAnywhere.ragIngest("RunAnywhere is a privacy-first on-device AI platform.")
RunAnywhere.ragIngest("The SDK supports LLM, STT, TTS, VAD, VLM, and RAG.")
RunAnywhere.ragIngest("LoRA adapters can be hot-swapped at runtime.")

// 3. Query (options is optional — pass it only when you need non-default settings)
val result = RunAnywhere.ragQuery(
    question = "What AI features does RunAnywhere support?"
)

println("Answer: ${result.answer}")
println("Sources: ${result.retrievedChunks.size} chunks")

// 4. Cleanup
RunAnywhere.ragDestroyPipeline()
```

## Pipeline Lifecycle

### Create Pipeline

```kotlin theme={null}
suspend fun RunAnywhere.ragCreatePipeline(config: RAGConfiguration)
```

Creates a RAG pipeline with an embedding model and LLM. The pipeline loads the embedding model immediately and initializes the vector index.

### Destroy Pipeline

```kotlin theme={null}
suspend fun RunAnywhere.ragDestroyPipeline()
```

Releases all resources (models, vector index, document store).

<Warning>
  Always call `ragDestroyPipeline()` when you're done to free memory. RAG pipelines load both an
  embedding model and potentially reference an LLM, which can consume significant memory.
</Warning>

## Document Ingestion

### Ingest Text

```kotlin theme={null}
suspend fun RunAnywhere.ragIngest(text: String, metadataJson: String? = null)
```

Ingests a text document into the pipeline. The text is automatically chunked, embedded, and indexed.

```kotlin theme={null}
// Simple ingestion
RunAnywhere.ragIngest("Your document text here...")

// With metadata for filtering/attribution
RunAnywhere.ragIngest(
    text = "Patient records indicate...",
    metadataJson = """{"source": "medical-docs", "page": 42, "date": "2025-01-15"}"""
)
```

| Parameter      | Type      | Description                               |
| -------------- | --------- | ----------------------------------------- |
| `text`         | `String`  | Document text to ingest                   |
| `metadataJson` | `String?` | Optional JSON metadata attached to chunks |

### Clear Documents

```kotlin theme={null}
suspend fun RunAnywhere.ragClearDocuments()
```

Removes all indexed documents from the pipeline.

### Document Count

```kotlin theme={null}
val RunAnywhere.ragDocumentCount: Int
```

Returns the number of indexed documents (synchronous property).

## Querying

### ragQuery

```kotlin theme={null}
suspend fun RunAnywhere.ragQuery(
    question: String,
    options: RAGQueryOptions? = null
): RAGResult
```

Queries the pipeline: embeds the question, retrieves relevant chunks via vector search, builds a context, and generates a grounded answer using the LLM.

```kotlin theme={null}
val result = RunAnywhere.ragQuery(
    question = "How does the privacy architecture work?",
    options = RAGQueryOptions(
        question = "How does the privacy architecture work?",
        maxTokens = 512,
        temperature = 0.7f,
        topK = 40
    )
)

// Access the answer
println(result.answer)

// Inspect retrieved sources
result.retrievedChunks.forEach { chunk ->
    println("Score: ${chunk.similarityScore} — ${chunk.text.take(100)}...")
}

// Check timing
println("Retrieval: ${result.retrievalTimeMs}ms")
println("Generation: ${result.generationTimeMs}ms")
println("Total: ${result.totalTimeMs}ms")
```

## API Reference

### RAGConfiguration

```kotlin theme={null}
data class RAGConfiguration(
    val embeddingModelPath: String,
    val llmModelPath: String,
    val embeddingDimension: Int = 384,
    val topK: Int = 3,
    val similarityThreshold: Float = 0.15f,
    val maxContextTokens: Int = 2048,
    val chunkSize: Int = 512,
    val chunkOverlap: Int = 50,
    val promptTemplate: String? = null,
    val embeddingConfigJson: String? = null,
    val llmConfigJson: String? = null
)
```

| Parameter             | Type      | Default | Description                                               |
| --------------------- | --------- | ------- | --------------------------------------------------------- |
| `embeddingModelPath`  | `String`  | —       | Path to ONNX embedding model                              |
| `llmModelPath`        | `String`  | —       | Path to GGUF LLM model                                    |
| `embeddingDimension`  | `Int`     | `384`   | Embedding vector dimension                                |
| `topK`                | `Int`     | `3`     | Number of chunks to retrieve                              |
| `similarityThreshold` | `Float`   | `0.15`  | Minimum cosine similarity (0.0-1.0)                       |
| `maxContextTokens`    | `Int`     | `2048`  | Max context tokens for LLM                                |
| `chunkSize`           | `Int`     | `512`   | Tokens per chunk                                          |
| `chunkOverlap`        | `Int`     | `50`    | Overlap tokens between chunks                             |
| `promptTemplate`      | `String?` | `null`  | Custom prompt with `{context}` and `{query}` placeholders |
| `embeddingConfigJson` | `String?` | `null`  | Optional embedding model config                           |
| `llmConfigJson`       | `String?` | `null`  | Optional LLM config                                       |

### RAGQueryOptions

```kotlin theme={null}
data class RAGQueryOptions(
    val question: String,
    val systemPrompt: String? = null,
    val maxTokens: Int = 512,
    val temperature: Float = 0.7f,
    val topP: Float = 0.9f,
    val topK: Int = 40
)
```

| Parameter      | Type      | Default | Description            |
| -------------- | --------- | ------- | ---------------------- |
| `question`     | `String`  | —       | The question to answer |
| `systemPrompt` | `String?` | `null`  | Override system prompt |
| `maxTokens`    | `Int`     | `512`   | Max tokens to generate |
| `temperature`  | `Float`   | `0.7`   | Sampling temperature   |
| `topP`         | `Float`   | `0.9`   | Nucleus sampling       |
| `topK`         | `Int`     | `40`    | Top-k sampling         |

### RAGResult

```kotlin theme={null}
data class RAGResult(
    val answer: String,
    val retrievedChunks: List<RAGSearchResult>,
    val contextUsed: String? = null,
    val retrievalTimeMs: Double,
    val generationTimeMs: Double,
    val totalTimeMs: Double
)
```

### RAGSearchResult

```kotlin theme={null}
data class RAGSearchResult(
    val chunkId: String,
    val text: String,
    val similarityScore: Float,
    val metadataJson: String? = null
)
```

## Examples

### Document Q\&A with Sources

```kotlin theme={null}
class DocumentQAViewModel : ViewModel() {
    private var isPipelineReady = false

    fun setupPipeline(embeddingPath: String, llmPath: String) {
        viewModelScope.launch {
            RunAnywhere.ragCreatePipeline(RAGConfiguration(
                embeddingModelPath = embeddingPath,
                llmModelPath = llmPath,
                topK = 5,
                chunkSize = 256,
                chunkOverlap = 30
            ))
            isPipelineReady = true
        }
    }

    fun ingestDocument(text: String, source: String) {
        viewModelScope.launch {
            RunAnywhere.ragIngest(
                text = text,
                metadataJson = """{"source": "$source"}"""
            )
            _docCount.value = RunAnywhere.ragDocumentCount
        }
    }

    fun askQuestion(question: String) {
        viewModelScope.launch {
            val result = RunAnywhere.ragQuery(question)

            _answer.value = result.answer
            _sources.value = result.retrievedChunks.map { chunk ->
                Source(
                    text = chunk.text.take(200),
                    score = chunk.similarityScore,
                    metadata = chunk.metadataJson
                )
            }
            _timing.value = "Retrieved in ${result.retrievalTimeMs.toInt()}ms, " +
                "generated in ${result.generationTimeMs.toInt()}ms"
        }
    }

    override fun onCleared() {
        super.onCleared()
        viewModelScope.launch {
            if (isPipelineReady) RunAnywhere.ragDestroyPipeline()
        }
    }
}
```

### Custom Prompt Template

```kotlin theme={null}
RunAnywhere.ragCreatePipeline(RAGConfiguration(
    embeddingModelPath = embeddingPath,
    llmModelPath = llmPath,
    promptTemplate = """You are a helpful assistant. Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {query}

Answer:"""
))
```

## Error Handling

```kotlin theme={null}
try {
    val result = RunAnywhere.ragQuery("What is RunAnywhere?")
    println(result.answer)
} catch (e: Exception) {
    when {
        e.message?.contains("pipeline") == true ->
            println("Create a pipeline first with ragCreatePipeline()")
        e.message?.contains("no documents") == true ->
            println("Ingest documents before querying")
        else ->
            println("RAG error: ${e.message}")
    }
}
```

## Performance Tips

<Tip>
  * **Choose the right embedding model** — `all-MiniLM-L6-v2` (384 dim) is a good default balance of
    speed and quality - **Tune chunk size** — smaller chunks (256) for precise retrieval, larger
    (512+) for broader context - **Use chunk overlap** — 10-20% overlap prevents losing context at
    chunk boundaries - **Adjust similarity threshold** — lower threshold (0.1) for more results,
    higher (0.5) for stricter relevance - **topK = 3-5** is usually optimal — too many chunks dilute
    context quality - **Destroy pipeline when done** — RAG pipelines hold both an embedding model and
    vector index in memory
</Tip>

## Related

<CardGroup cols={2}>
  <Card title="LLM Generation" icon="brain" href="/kotlin/llm/generate">
    Text generation API
  </Card>

  <Card title="LoRA Adapters" icon="layer-group" href="/kotlin/lora">
    Fine-tune model behavior
  </Card>

  <Card title="Model Management" icon="download" href="/kotlin/configuration">
    Register and download models
  </Card>

  <Card title="Best Practices" icon="star" href="/kotlin/best-practices">
    Performance optimization
  </Card>
</CardGroup>