Skip to main content

Overview

The generate() method provides full control over text generation with customizable options and detailed performance metrics. Use this for production applications where you need fine-grained control.

Basic Usage

import { RunAnywhere } from '@runanywhere/core'

const result = await RunAnywhere.generate('Explain quantum computing in simple terms', {
  maxTokens: 200,
  temperature: 0.7,
})

console.log('Response:', result.text)
console.log('Tokens used:', result.tokensUsed)
console.log('Speed:', result.performanceMetrics.tokensPerSecond, 'tok/s')
console.log('Latency:', result.latencyMs, 'ms')

API Reference

await RunAnywhere.generate(
  prompt: string,
  options?: GenerationOptions
): Promise<GenerationResult>

Parameters

interface GenerationOptions {
  /** Maximum tokens to generate (default: 256) */
  maxTokens?: number

  /** Sampling temperature 0.0–2.0 (default: 0.7) */
  temperature?: number

  /** Top-p nucleus sampling (default: 0.95) */
  topP?: number

  /** Stop generation at these sequences */
  stopSequences?: string[]

  /** System prompt to define AI behavior */
  systemPrompt?: string

  /** Preferred execution target */
  preferredExecutionTarget?: ExecutionTarget

  /** Preferred inference framework */
  preferredFramework?: LLMFramework
}

Returns

interface GenerationResult {
  /** Generated text (thinking content removed if extracted) */
  text: string

  /** Extracted thinking/reasoning content (if model supports it) */
  thinkingContent?: string

  /** Total tokens used (prompt + response) */
  tokensUsed: number

  /** Number of tokens in the response */
  responseTokens: number

  /** Model ID that was used */
  modelUsed: string

  /** Total latency in milliseconds */
  latencyMs: number

  /** Execution target (onDevice/cloud/hybrid) */
  executionTarget: ExecutionTarget

  /** Framework used for inference */
  framework?: LLMFramework

  /** Hardware acceleration used */
  hardwareUsed: HardwareAcceleration

  /** Memory used during generation (bytes) */
  memoryUsed: number

  /** Detailed performance metrics */
  performanceMetrics: PerformanceMetrics
}

interface PerformanceMetrics {
  /** Time to first token in ms */
  timeToFirstTokenMs?: number

  /** Tokens generated per second */
  tokensPerSecond?: number

  /** Total inference time in ms */
  inferenceTimeMs: number
}

Generation Options

Temperature

Controls randomness in the output. Lower values make output more focused and deterministic.
// Creative writing - higher temperature
const creative = await RunAnywhere.generate('Write a poem about the ocean', {
  temperature: 1.2,
  maxTokens: 150,
})

// Factual response - lower temperature
const factual = await RunAnywhere.generate('What is the boiling point of water?', {
  temperature: 0.1,
  maxTokens: 50,
})

// Balanced (default)
const balanced = await RunAnywhere.generate('Explain machine learning', {
  temperature: 0.7,
  maxTokens: 200,
})
TemperatureUse Case
0.0–0.3Factual, deterministic responses
0.4–0.7Balanced, general-purpose
0.8–1.2Creative, varied outputs
1.3–2.0Very creative, experimental

Max Tokens

Limits the length of the generated response.
// Short answer
const short = await RunAnywhere.generate('What is 2+2?', { maxTokens: 10 })

// Detailed explanation
const detailed = await RunAnywhere.generate('Explain how computers work', { maxTokens: 500 })

Stop Sequences

Stop generation when specific sequences are encountered.
const result = await RunAnywhere.generate('List 3 fruits:', {
  maxTokens: 100,
  stopSequences: ['4.', '\n\n'], // Stop at 4th item or double newline
})

System Prompts

Define the AI’s behavior and persona.
const result = await RunAnywhere.generate('What is the best programming language?', {
  maxTokens: 200,
  systemPrompt: 'You are a helpful coding assistant. Be concise and practical.',
})
See System Prompts for more details.

Examples

Full Example with Metrics

async function generateWithMetrics(prompt: string) {
  const result = await RunAnywhere.generate(prompt, {
    maxTokens: 200,
    temperature: 0.7,
  })

  console.log('=== Generation Results ===')
  console.log('Response:', result.text)
  console.log('')
  console.log('=== Metrics ===')
  console.log('Tokens used:', result.tokensUsed)
  console.log('Response tokens:', result.responseTokens)
  console.log('Total latency:', result.latencyMs, 'ms')
  console.log('TTFT:', result.performanceMetrics.timeToFirstTokenMs, 'ms')
  console.log('Speed:', result.performanceMetrics.tokensPerSecond?.toFixed(1), 'tok/s')
  console.log('Hardware:', result.hardwareUsed)
  console.log('Memory used:', (result.memoryUsed / 1024 / 1024).toFixed(1), 'MB')

  return result
}

React Hook

useGenerate.ts
import { useState, useCallback } from 'react'
import { RunAnywhere, GenerationOptions, GenerationResult } from '@runanywhere/core'

export function useGenerate() {
  const [result, setResult] = useState<GenerationResult | null>(null)
  const [isLoading, setIsLoading] = useState(false)
  const [error, setError] = useState<Error | null>(null)

  const generate = useCallback(async (prompt: string, options?: GenerationOptions) => {
    setIsLoading(true)
    setError(null)

    try {
      const res = await RunAnywhere.generate(prompt, options)
      setResult(res)
      return res
    } catch (err) {
      const e = err instanceof Error ? err : new Error('Generation failed')
      setError(e)
      throw e
    } finally {
      setIsLoading(false)
    }
  }, [])

  const reset = useCallback(() => {
    setResult(null)
    setError(null)
  }, [])

  return { generate, result, isLoading, error, reset }
}

Thinking Models

Some models support “thinking” or reasoning before responding:
// Add a model with thinking support
await LlamaCPP.addModel({
  id: 'qwq-32b',
  name: 'QwQ 32B',
  url: 'https://huggingface.co/.../qwq-32b-q4_k_m.gguf',
  memoryRequirement: 20_000_000_000,
  supportsThinking: true, // Enable thinking extraction
})

// Generate with thinking
const result = await RunAnywhere.generate('Solve this step by step: What is 15% of 240?', {
  maxTokens: 500,
})

console.log('Thinking:', result.thinkingContent)
// "Let me calculate 15% of 240. First, I'll convert 15% to a decimal..."

console.log('Answer:', result.text)
// "15% of 240 is 36."

Cancellation

Cancel an ongoing generation:
// Start generation
const promise = RunAnywhere.generate('Write a long story...', { maxTokens: 1000 })

// Cancel after 2 seconds
setTimeout(() => {
  RunAnywhere.cancelGeneration()
}, 2000)

try {
  const result = await promise
} catch (error) {
  if (isSDKError(error) && error.code === SDKErrorCode.generationCancelled) {
    console.log('Generation was cancelled')
  }
}