Skip to main content
Early Beta — The Web SDK is in early beta. APIs may change between releases.

Overview

The generate() method provides full control over text generation with customizable options and detailed performance metrics. Use this for production applications where you need fine-grained control.

Basic Usage

import { TextGeneration } from '@runanywhere/web'

const result = await TextGeneration.generate('Explain quantum computing in simple terms', {
  maxTokens: 200,
  temperature: 0.7,
})

console.log('Response:', result.text)
console.log('Tokens used:', result.tokensUsed)
console.log('Speed:', result.tokensPerSecond.toFixed(1), 'tok/s')
console.log('Latency:', result.latencyMs, 'ms')

API Reference

await TextGeneration.generate(
  prompt: string,
  options?: LLMGenerationOptions
): Promise<LLMGenerationResult>

Parameters

interface LLMGenerationOptions {
  /** Maximum tokens to generate (default: 256) */
  maxTokens?: number

  /** Sampling temperature 0.0-2.0 (default: 0.7) */
  temperature?: number

  /** Top-p nucleus sampling (default: 0.95) */
  topP?: number

  /** Top-k sampling */
  topK?: number

  /** Stop generation at these sequences */
  stopSequences?: string[]

  /** System prompt to define AI behavior */
  systemPrompt?: string

  /** Enable streaming mode */
  streamingEnabled?: boolean
}

Returns

interface LLMGenerationResult {
  /** Generated text */
  text: string

  /** Extracted thinking/reasoning content (if model supports it) */
  thinkingContent?: string

  /** Number of input tokens */
  inputTokens: number

  /** Total tokens used (prompt + response) */
  tokensUsed: number

  /** Model ID that was used */
  modelUsed: string

  /** Total latency in milliseconds */
  latencyMs: number

  /** Framework used for inference */
  framework: LLMFramework

  /** Hardware acceleration used */
  hardwareUsed: HardwareAcceleration

  /** Tokens generated per second */
  tokensPerSecond: number

  /** Time to first token in ms */
  timeToFirstTokenMs?: number

  /** Thinking tokens count */
  thinkingTokens: number

  /** Response tokens count */
  responseTokens: number
}

Generation Options

Temperature

Controls randomness in the output. Lower values make output more focused and deterministic.
// Creative writing - higher temperature
const creative = await TextGeneration.generate('Write a poem about the ocean', {
  temperature: 1.2,
  maxTokens: 150,
})

// Factual response - lower temperature
const factual = await TextGeneration.generate('What is the boiling point of water?', {
  temperature: 0.1,
  maxTokens: 50,
})

// Balanced (default)
const balanced = await TextGeneration.generate('Explain machine learning', {
  temperature: 0.7,
  maxTokens: 200,
})
TemperatureUse Case
0.0-0.3Factual, deterministic responses
0.4-0.7Balanced, general-purpose
0.8-1.2Creative, varied outputs
1.3-2.0Very creative, experimental

Max Tokens

Limits the length of the generated response.
// Short answer
const short = await TextGeneration.generate('What is 2+2?', { maxTokens: 10 })

// Detailed explanation
const detailed = await TextGeneration.generate('Explain how computers work', { maxTokens: 500 })

Stop Sequences

Stop generation when specific sequences are encountered.
const result = await TextGeneration.generate('List 3 fruits:', {
  maxTokens: 100,
  stopSequences: ['4.', '\n\n'],
})

System Prompts

Define the AI’s behavior and persona.
const result = await TextGeneration.generate('What is the best programming language?', {
  maxTokens: 200,
  systemPrompt: 'You are a helpful coding assistant. Be concise and practical.',
})
See System Prompts for more details.

Examples

Full Example with Metrics

async function generateWithMetrics(prompt: string) {
  const result = await TextGeneration.generate(prompt, {
    maxTokens: 200,
    temperature: 0.7,
  })

  console.log('=== Generation Results ===')
  console.log('Response:', result.text)
  console.log('')
  console.log('=== Metrics ===')
  console.log('Input tokens:', result.inputTokens)
  console.log('Response tokens:', result.responseTokens)
  console.log('Total latency:', result.latencyMs, 'ms')
  console.log('TTFT:', result.timeToFirstTokenMs, 'ms')
  console.log('Speed:', result.tokensPerSecond.toFixed(1), 'tok/s')
  console.log('Hardware:', result.hardwareUsed)

  return result
}

Thinking Models

Some models support “thinking” or reasoning before responding:
const result = await TextGeneration.generate('Solve this step by step: What is 15% of 240?', {
  maxTokens: 500,
})

if (result.thinkingContent) {
  console.log('Thinking:', result.thinkingContent)
  // "Let me calculate 15% of 240. First, I'll convert 15% to a decimal..."
}

console.log('Answer:', result.text)
// "15% of 240 is 36."

console.log('Thinking tokens:', result.thinkingTokens)
console.log('Response tokens:', result.responseTokens)

Cancellation

Cancel an ongoing generation:
// Start generation
const promise = TextGeneration.generate('Write a long story...', { maxTokens: 1000 })

// Cancel after 2 seconds
setTimeout(() => {
  TextGeneration.cancel()
}, 2000)

try {
  const result = await promise
} catch (err) {
  if (err instanceof SDKError && err.code === SDKErrorCode.GenerationCancelled) {
    console.log('Generation was cancelled')
  }
}