Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.runanywhere.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The generate() method provides full control over text generation with customizable options and detailed performance metrics. Use this for production applications where you need fine-grained control.

Basic Usage

import { RunAnywhere } from '@runanywhere/core'

const result = await RunAnywhere.generate('Explain quantum computing in simple terms', {
  maxTokens: 200,
  temperature: 0.7,
})

console.log('Response:', result.text)
console.log('Tokens used:', result.tokensUsed)
console.log('Speed:', result.performanceMetrics.tokensPerSecond, 'tok/s')
console.log('Latency:', result.latencyMs, 'ms')

API Reference

await RunAnywhere.generate(
  prompt: string,
  options?: GenerationOptions
): Promise<GenerationResult>

Parameters

interface GenerationOptions {
  /** Maximum tokens to generate (default: 256) */
  maxTokens?: number

  /** Sampling temperature 0.0–2.0 (default: 0.7) */
  temperature?: number

  /** Top-p nucleus sampling (default: 0.95) */
  topP?: number

  /** Stop generation at these sequences */
  stopSequences?: string[]

  /** System prompt to define AI behavior */
  systemPrompt?: string

  /** Preferred execution target */
  preferredExecutionTarget?: ExecutionTarget

  /** Preferred inference framework */
  preferredFramework?: LLMFramework
}

Returns

interface GenerationResult {
  /** Generated text (thinking content removed if extracted) */
  text: string

  /** Extracted thinking/reasoning content (if model supports it) */
  thinkingContent?: string

  /** Total tokens used (prompt + response) */
  tokensUsed: number

  /** Number of tokens in the response */
  responseTokens: number

  /** Model ID that was used */
  modelUsed: string

  /** Total latency in milliseconds */
  latencyMs: number

  /** Execution target (onDevice/cloud/hybrid) */
  executionTarget: ExecutionTarget

  /** Framework used for inference */
  framework?: LLMFramework

  /** Hardware acceleration used */
  hardwareUsed: HardwareAcceleration

  /** Memory used during generation (bytes) */
  memoryUsed: number

  /** Detailed performance metrics */
  performanceMetrics: PerformanceMetrics
}

interface PerformanceMetrics {
  /** Time to first token in ms */
  timeToFirstTokenMs?: number

  /** Tokens generated per second */
  tokensPerSecond?: number

  /** Total inference time in ms */
  inferenceTimeMs: number
}

Generation Options

Temperature

Controls randomness in the output. Lower values make output more focused and deterministic.
// Creative writing - higher temperature
const creative = await RunAnywhere.generate('Write a poem about the ocean', {
  temperature: 1.2,
  maxTokens: 150,
})

// Factual response - lower temperature
const factual = await RunAnywhere.generate('What is the boiling point of water?', {
  temperature: 0.1,
  maxTokens: 50,
})

// Balanced (default)
const balanced = await RunAnywhere.generate('Explain machine learning', {
  temperature: 0.7,
  maxTokens: 200,
})
TemperatureUse Case
0.0–0.3Factual, deterministic responses
0.4–0.7Balanced, general-purpose
0.8–1.2Creative, varied outputs
1.3–2.0Very creative, experimental

Max Tokens

Limits the length of the generated response.
// Short answer
const short = await RunAnywhere.generate('What is 2+2?', { maxTokens: 10 })

// Detailed explanation
const detailed = await RunAnywhere.generate('Explain how computers work', { maxTokens: 500 })

Stop Sequences

Stop generation when specific sequences are encountered.
const result = await RunAnywhere.generate('List 3 fruits:', {
  maxTokens: 100,
  stopSequences: ['4.', '\n\n'], // Stop at 4th item or double newline
})

System Prompts

Define the AI’s behavior and persona.
const result = await RunAnywhere.generate('What is the best programming language?', {
  maxTokens: 200,
  systemPrompt: 'You are a helpful coding assistant. Be concise and practical.',
})
See System Prompts for more details.

Examples

Full Example with Metrics

async function generateWithMetrics(prompt: string) {
  const result = await RunAnywhere.generate(prompt, {
    maxTokens: 200,
    temperature: 0.7,
  })

  console.log('=== Generation Results ===')
  console.log('Response:', result.text)
  console.log('')
  console.log('=== Metrics ===')
  console.log('Tokens used:', result.tokensUsed)
  console.log('Response tokens:', result.responseTokens)
  console.log('Total latency:', result.latencyMs, 'ms')
  console.log('TTFT:', result.performanceMetrics.timeToFirstTokenMs, 'ms')
  console.log('Speed:', result.performanceMetrics.tokensPerSecond?.toFixed(1), 'tok/s')
  console.log('Hardware:', result.hardwareUsed)
  console.log('Memory used:', (result.memoryUsed / 1024 / 1024).toFixed(1), 'MB')

  return result
}

React Hook

useGenerate.ts
import { useState, useCallback } from 'react'
import { RunAnywhere, GenerationOptions, GenerationResult } from '@runanywhere/core'

export function useGenerate() {
  const [result, setResult] = useState<GenerationResult | null>(null)
  const [isLoading, setIsLoading] = useState(false)
  const [error, setError] = useState<Error | null>(null)

  const generate = useCallback(async (prompt: string, options?: GenerationOptions) => {
    setIsLoading(true)
    setError(null)

    try {
      const res = await RunAnywhere.generate(prompt, options)
      setResult(res)
      return res
    } catch (err) {
      const e = err instanceof Error ? err : new Error('Generation failed')
      setError(e)
      throw e
    } finally {
      setIsLoading(false)
    }
  }, [])

  const reset = useCallback(() => {
    setResult(null)
    setError(null)
  }, [])

  return { generate, result, isLoading, error, reset }
}

Thinking Models

Some models support “thinking” or reasoning before responding:
// Add a model with thinking support
await LlamaCPP.addModel({
  id: 'qwq-32b',
  name: 'QwQ 32B',
  url: 'https://huggingface.co/.../qwq-32b-q4_k_m.gguf',
  memoryRequirement: 20_000_000_000,
  supportsThinking: true, // Enable thinking extraction
})

// Generate with thinking
const result = await RunAnywhere.generate('Solve this step by step: What is 15% of 240?', {
  maxTokens: 500,
})

console.log('Thinking:', result.thinkingContent)
// "Let me calculate 15% of 240. First, I'll convert 15% to a decimal..."

console.log('Answer:', result.text)
// "15% of 240 is 36."

Cancellation

Cancel an ongoing generation:
// Start generation
const promise = RunAnywhere.generate('Write a long story...', { maxTokens: 1000 })

// Cancel after 2 seconds
setTimeout(() => {
  RunAnywhere.cancelGeneration()
}, 2000)

try {
  const result = await promise
} catch (error) {
  if (isSDKError(error) && error.code === SDKErrorCode.generationCancelled) {
    console.log('Generation was cancelled')
  }
}

Chat

Simple one-liner interface

Streaming

Real-time token streaming

System Prompts

Control AI behavior

Best Practices

Optimization tips