The generate() method provides full control over text generation with customizable options and detailed performance metrics. Use this for production applications where you need fine-grained control.
interface GenerationResult { /** Generated text (thinking content removed if extracted) */ text: string /** Extracted thinking/reasoning content (if model supports it) */ thinkingContent?: string /** Total tokens used (prompt + response) */ tokensUsed: number /** Number of tokens in the response */ responseTokens: number /** Model ID that was used */ modelUsed: string /** Total latency in milliseconds */ latencyMs: number /** Execution target (onDevice/cloud/hybrid) */ executionTarget: ExecutionTarget /** Framework used for inference */ framework?: LLMFramework /** Hardware acceleration used */ hardwareUsed: HardwareAcceleration /** Memory used during generation (bytes) */ memoryUsed: number /** Detailed performance metrics */ performanceMetrics: PerformanceMetrics}interface PerformanceMetrics { /** Time to first token in ms */ timeToFirstTokenMs?: number /** Tokens generated per second */ tokensPerSecond?: number /** Total inference time in ms */ inferenceTimeMs: number}
const result = await RunAnywhere.generate('What is the best programming language?', { maxTokens: 200, systemPrompt: 'You are a helpful coding assistant. Be concise and practical.',})
Some models support “thinking” or reasoning before responding:
// Add a model with thinking supportawait LlamaCPP.addModel({ id: 'qwq-32b', name: 'QwQ 32B', url: 'https://huggingface.co/.../qwq-32b-q4_k_m.gguf', memoryRequirement: 20_000_000_000, supportsThinking: true, // Enable thinking extraction})// Generate with thinkingconst result = await RunAnywhere.generate('Solve this step by step: What is 15% of 240?', { maxTokens: 500,})console.log('Thinking:', result.thinkingContent)// "Let me calculate 15% of 240. First, I'll convert 15% to a decimal..."console.log('Answer:', result.text)// "15% of 240 is 36."