Skip to main content

Overview

The STT API provides various options to customize transcription behavior, from language selection to word-level timestamps.

Options Reference

interface STTOptions {
  /** Language code for transcription */
  language?: string

  /** Enable punctuation in output */
  punctuation?: boolean

  /** Enable speaker diarization (multi-speaker) */
  diarization?: boolean

  /** Enable word-level timestamps */
  wordTimestamps?: boolean

  /** Audio sample rate (default: 16000) */
  sampleRate?: number
}

Language Support

Specify the language code to improve accuracy:
// English
const english = await RunAnywhere.transcribeFile(path, { language: 'en' })

// Spanish
const spanish = await RunAnywhere.transcribeFile(path, { language: 'es' })

// French
const french = await RunAnywhere.transcribeFile(path, { language: 'fr' })

// Auto-detect (slower)
const auto = await RunAnywhere.transcribeFile(path)

Supported Languages

CodeLanguageCodeLanguage
enEnglishjaJapanese
esSpanishkoKorean
frFrenchptPortuguese
deGermanruRussian
itItalianzhChinese
nlDutcharArabic
plPolishhiHindi
Language-specific models (e.g., whisper-tiny.en) only support that language but are more accurate and faster.

Punctuation

Add punctuation to transcription output:
// Without punctuation (default for some models)
const noPunct = await RunAnywhere.transcribeFile(path, {
  language: 'en',
  punctuation: false,
})
// "hello how are you today"

// With punctuation
const withPunct = await RunAnywhere.transcribeFile(path, {
  language: 'en',
  punctuation: true,
})
// "Hello, how are you today?"

Word Timestamps

Get timing information for each word:
const result = await RunAnywhere.transcribeFile(audioPath, {
  language: 'en',
  wordTimestamps: true,
})

console.log('Transcription:', result.text)

// Each segment contains word-level timing
for (const segment of result.segments) {
  console.log(`[${segment.startTime.toFixed(2)}s - ${segment.endTime.toFixed(2)}s] ${segment.text}`)
}

Use Cases

  • Subtitles/Captions: Sync text with video
  • Karaoke: Highlight words as they’re spoken
  • Search: Jump to specific moments in audio
  • Accessibility: Show words as they’re spoken

Example: Subtitle Generator

interface Subtitle {
  start: number
  end: number
  text: string
}

async function generateSubtitles(audioPath: string): Promise<Subtitle[]> {
  const result = await RunAnywhere.transcribeFile(audioPath, {
    language: 'en',
    wordTimestamps: true,
  })

  return result.segments.map((segment) => ({
    start: segment.startTime,
    end: segment.endTime,
    text: segment.text.trim(),
  }))
}

// Convert to SRT format
function toSRT(subtitles: Subtitle[]): string {
  return subtitles
    .map((sub, i) => {
      const start = formatSRTTime(sub.start)
      const end = formatSRTTime(sub.end)
      return `${i + 1}\n${start} --> ${end}\n${sub.text}\n`
    })
    .join('\n')
}

function formatSRTTime(seconds: number): string {
  const h = Math.floor(seconds / 3600)
  const m = Math.floor((seconds % 3600) / 60)
  const s = Math.floor(seconds % 60)
  const ms = Math.floor((seconds % 1) * 1000)
  return `${h.toString().padStart(2, '0')}:${m.toString().padStart(2, '0')}:${s.toString().padStart(2, '0')},${ms.toString().padStart(3, '0')}`
}

Speaker Diarization

Identify different speakers in the audio:
const result = await RunAnywhere.transcribeFile(audioPath, {
  language: 'en',
  diarization: true,
})

// Segments include speaker IDs
for (const segment of result.segments) {
  console.log(`[Speaker ${segment.speakerId}]: ${segment.text}`)
}
Speaker diarization is computationally expensive and may not be available on all models. Check model documentation for support.

Sample Rate

Specify the audio sample rate if different from the default:
// Standard 16kHz (recommended for STT)
const standard = await RunAnywhere.transcribeBuffer(samples, 16000)

// Higher quality 44.1kHz (will be downsampled internally)
const highQuality = await RunAnywhere.transcribeBuffer(samples, 44100, {
  sampleRate: 44100,
})
For best results, record audio at 16kHz mono. Higher sample rates will be downsampled, which adds processing overhead.

Model Loading Options

Configure model loading:
// Load STT model with specific type
await RunAnywhere.loadSTTModel(modelPath, 'whisper')

// Check if model is loaded
const isLoaded = await RunAnywhere.isSTTModelLoaded()

// Unload when done
await RunAnywhere.unloadSTTModel()

Combining Options

// Full-featured transcription
const result = await RunAnywhere.transcribeFile(audioPath, {
  language: 'en',
  punctuation: true,
  wordTimestamps: true,
})

// Access all result data
console.log('Text:', result.text)
console.log('Language:', result.language)
console.log('Confidence:', result.confidence)
console.log('Duration:', result.duration, 'seconds')
console.log('Segments:', result.segments.length)

// Alternatives (if available)
if (result.alternatives.length > 0) {
  console.log('Alternatives:')
  result.alternatives.forEach((alt, i) => {
    console.log(`  ${i + 1}. ${alt.text} (confidence: ${alt.confidence})`)
  })
}

Performance vs Accuracy

OptionImpact on SpeedImpact on Accuracy
language specified Faster Better
wordTimestamps Slower Same
diarization Much slower Same
punctuation Minimal Same
For best performance, always specify the language option rather than relying on auto-detection.