Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.burki.dev/llms.txt

Use this file to discover all available pages before exploring further.

Speech-to-Text (STT) converts what callers say into text for your AI to understand. Burki Voice AI supports multiple STT providers—choose based on your needs for speed, language support, or enterprise features.

Provider Comparison

ProviderProvider KeyModelsBest ForNotes
Deepgramdeepgramnova-3, nova-2, nova, enhanced, base, flux-general-*Fast phone calls, English keyterms, Flux conversational STTDefault STT provider
ElevenLabselevenlabsscribe_v2_realtimeMulti-language realtime recognitionUses elevenlabs_config
Azure Speechazurestandard, enhanced, neuralEnterprise speech workloadsAvailable when Azure STT dependencies/config are present
Upliftupliftdefault, scribe, scribe-miniUrdu/South Asian language workflowsDefaults to ur, µ-law, 8 kHz when unset
Speechmaticsspeechmaticsenhanced, standardBroad language coverage and hosted STTUses speechmatics_config
Telnyx STTtelnyxdeepgram/nova-3, deepgram/flux, Google/Azure/Telnyx Whisper hosted modelsTelnyx-hosted transcription with carrier key reuseDefaults to deepgram/nova-3
Sonioxsonioxstt-rt-v4, stt-rt-v3Realtime STT with Soniox modelsUses soniox_config

⚡ Deepgram

Ultra-Low Latency~100ms response time, optimized for phone calls. Nova-3 keyterms for English, Nova-2 for multi-language.

🎙️ ElevenLabs Scribe v2

Multi-Language ExcellenceVendor-reported low latency, broad language support, and advanced VAD-based speech detection.

☁️ Azure Speech

Enterprise Scale100+ languages, Microsoft ecosystem integration, phrase lists for term boosting, custom speech models.

🌐 Speechmatics

Hosted STTEnhanced and standard models with provider-specific config.

📡 Telnyx STT

Carrier-Hosted ModelsTelnyx-hosted Deepgram, Google, Azure, and Whisper model routes.

⚙️ Soniox / Uplift

Specialized Realtime STTSoniox realtime models and Uplift Scribe models.

Deepgram

Deepgram is the default STT provider, optimized for speed and phone call quality.

Models

ModelFeaturesKeywordsKeytermsBest For
Nova-3Latest, keyterms supportEnglish calls, best accuracy
Nova-2Keywords supportMulti-language, reliable
NovaKeywords supportBalanced performance
EnhancedKeywords supportLegacy support
BaseKeywords supportBasic transcription
Flux (flux-general-en, flux-general-es, flux-general-multi)Conversational realtime STT pathProvider-dependentProvider-dependentFast turn-taking and mid-call Flux swaps
Recommended: Use Nova-3 for English calls (supports keyterms) or Nova-2 for other languages (supports keywords).

Configuration

{
  "stt_settings": {
    "provider": "deepgram",
    "model": "nova-3",
    "language": "en-US"
  }
}

ElevenLabs Scribe v2

ElevenLabs Scribe v2 Realtime provides ultra-low latency speech recognition with excellent multi-language support and advanced voice activity detection.Key Features:
  • Vendor-reported low-latency realtime recognition; actual accuracy depends on language, audio quality, and model configuration
  • 90+ languages supported
  • Advanced VAD-based commit strategy
  • Word-level timestamps support
  • Automatic language detection
Setup:
  1. Sign up at ElevenLabs
  2. Get your API key from the dashboard
  3. Configure in assistant settings
Configuration:
{
  "stt_settings": {
    "provider": "elevenlabs",
    "model": "scribe_v2_realtime",
    "language": "en",
    "elevenlabs_config": {
      "commit_strategy": "vad",
      "vad_threshold": 0.4,
      "vad_silence_threshold_secs": 1.5
    }
  }
}

📖 Full ElevenLabs Documentation

See the complete ElevenLabs Scribe v2 guide for VAD settings, language options, and best practices.

Azure Speech

Azure Speech provides managed speech recognition with broad language support and Microsoft ecosystem integration.Key Features:
  • 100+ languages and regional variants
  • Phrase lists for domain-specific term boosting
  • Custom speech models for specialized vocabulary
  • Speaker diarization support
Setup:
  1. Create an Azure Speech resource using the Azure AI Speech quickstart
  2. Get your subscription key and region
  3. Configure in assistant settings
Configuration:
{
  "stt_settings": {
    "provider": "azure",
    "model": "standard",
    "language": "en-US",
    "azure_config": {
      "subscription_key": "your_key",
      "region": "eastus"
    }
  }
}

📖 Full Azure Documentation

See the complete Azure Speech STT guide for models, languages, configuration options, and best practices.

Additional Supported Providers

These providers are wired in the backend STT factory and can be selected by stt_settings.provider.

Uplift

{
  "stt_settings": {
    "provider": "uplift",
    "model": "scribe",
    "language": "ur"
  }
}
Supported models: default, scribe, scribe-mini.

Speechmatics

{
  "stt_settings": {
    "provider": "speechmatics",
    "model": "enhanced",
    "language": "en"
  }
}
Supported models: enhanced, standard. Provider-specific settings live under speechmatics_config.

Telnyx STT

{
  "stt_settings": {
    "provider": "telnyx",
    "model": "deepgram/nova-3",
    "language": "en-US"
  }
}
Supported hosted model routes include deepgram/nova-3, deepgram/nova-2, deepgram/flux, Google, Azure, and Telnyx Whisper variants. Telnyx STT uses the organization’s Telnyx API key or managed carrier key.

Soniox

{
  "stt_settings": {
    "provider": "soniox",
    "model": "stt-rt-v4",
    "language": "en"
  }
}
Supported models: stt-rt-v4, stt-rt-v3. Provider-specific settings live under soniox_config.
OpenAI Whisper and Assembly appear in older enum/model mapping code but are not registered in the active STT factory. Do not configure them as live realtime STT providers unless the backend factory is updated.

Key Settings

  • Provider: Choose from deepgram, elevenlabs, azure, uplift, speechmatics, telnyx, or soniox
  • Model: Choose based on your provider (nova-3, flux-general-en, scribe_v2_realtime, standard, deepgram/nova-3, stt-rt-v4, etc.)
  • Language: Select from common options or enter a custom language code
  • Custom Language: Enter any supported language code (e.g., fr-FR, es-ES)
These settings control how the STT provider detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.

Endpointing (Silence Threshold)

What it does: How long the provider waits after detecting silence before considering speech has ended.Technical Details:
  • Measured in: Milliseconds
  • Default: 10ms (minimal endpointing for real-time applications)
  • Range: 10ms - 2000ms (recommended)
  • Config Path: stt_settings.endpointing.silence_threshold
Real Example:
  • 10ms: Very responsive (default) - might cut off slow speakers
  • 500ms: “I need help with…” → 0.5s silence → Provider says “speech ended”
  • 1000ms: More patient (good for people who pause while thinking)
When to Adjust:
  • Lower (10-100ms): For fast talkers or quick interactions (default)
  • Higher (500-1000ms): For elderly callers or complex topics
  • Much higher (1500ms+): For people with speech difficulties

Min Silence Duration

What it does: Internal timeout for utterance processing when the provider doesn’t send speech_final (not sent to provider API).Technical Details:
  • Measured in: Milliseconds
  • Default: 1500ms
  • Range: 500ms - 5000ms (recommended)
  • Config Path: stt_settings.endpointing.min_silence_duration
  • Used for: Call handler utterance timeout logic when speech_final is missing
Real Example:
  • 1500ms: Wait 1.5s for speech_final, then process accumulated utterance (default)
  • 1000ms: Quicker timeout for responsive conversation
  • 2500ms: More patience for complex responses or noisy environments
When to Adjust:
  • Lower (500-1000ms): For quick, responsive interactions
  • Higher (2000-3000ms): For environments with background noise where speech_final may be unreliable
  • Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions

Utterance End Timeout

What it does: Maximum time the provider waits for a complete utterance before sending UtteranceEnd event.Technical Details:
  • Measured in: Milliseconds
  • Default: 1000ms
  • Range: 500ms - 5000ms (recommended)
  • Config Path: stt_settings.utterance_end_ms
  • API Parameter: utterance_end_ms
Real Example:
  • 1000ms: If someone starts talking but doesn’t finish within 1 second, provider sends UtteranceEnd (default)
  • 500ms: Quick timeout (might cut off long sentences)
  • 2000ms: Patient timeout (good for complex responses)
When to Adjust:
  • Lower (500-800ms): For short, quick interactions
  • Higher (1500-3000ms): For detailed conversations or forms
  • Consider your use case: Customer service vs. quick orders

VAD Events

What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:
  • Type: Boolean (true/false)
  • Default: true (enabled)
  • Config Path: stt_settings.vad_events
  • API Parameter: vad_events
Real Example:
  • true: Enhanced speech detection with UtteranceEnd events when speech_final doesn’t work (recommended)
  • false: Basic speech detection only (legacy mode)
When to Enable:
  • Always recommended: Provides better speech detection in noisy environments
  • Essential for: Background noise, poor connections, multiple speakers
  • Backup mechanism: When speech_final doesn’t trigger due to audio issues
Why It Matters: VAD events provide UtteranceEnd signals as a fallback when normal speech detection fails due to background noise or audio quality issues.

🎯 Timing Settings Quick Guide

Real-Time/Fast Conversations (Default):
  • Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true
Balanced Professional:
  • Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true
Patient/Elderly Callers:
  • Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true
Critical: These settings work together with Call Management interruption settings. Endpointing controls provider responsiveness, Min Silence Duration controls internal timeout handling, and both affect conversation flow timing.
Keywords (Deepgram Nova-2, Nova, Enhanced, Base):
  • Boost recognition of specific words
  • Format: word:boost_factor (e.g., Deepgram:2.0, API:1.5)
  • Great for company names, technical terms
Keyterms (Deepgram Nova-3 only, English only):
  • Advanced keyword detection
  • Format: word1, word2, word3
  • More sophisticated than keywords
Phrase Lists (Azure Speech):
  • Boost recognition of specific terms
  • Format: Comma-separated list
  • Works with all Azure models and languages
Use keywords/keyterms/phrase lists for your company name, product names, and industry-specific terms to improve accuracy.

Audio Denoising

Burki Voice AI includes RNNoise for real-time audio denoising, which removes background noise before transcription.
When to Enable:
  • Noisy environments (restaurants, offices, outdoors)
  • Poor phone connections
  • Background music or chatter
Trade-offs:
  • Slightly increases latency (~50-100ms)
  • Improves transcription accuracy in noisy conditions

Troubleshooting

Speech Detection Problems:
  • AI misses words: Enable denoising or add keywords/phrase lists for important terms
  • Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
  • Long awkward pauses: Decrease min silence duration for faster internal processing
  • Interrupts slow speakers: Increase endpointing and min silence duration
  • Misses trailing words: Enable VAD events and increase utterance end timeout
Language & Recognition:
  • Wrong language detected: Set correct language code or use “custom” option
  • Technical terms not recognized: Add them as keywords/keyterms/phrase lists with boost factors
  • Company names garbled: Add company/product names to keywords list
Audio Quality:
  • Noisy background: Enable audio denoising and increase VAD turnoff
  • Poor phone connection: Enable denoising and use more conservative timing settings
  • Multiple speakers: Use higher silence thresholds to avoid cross-talk issues
Provider-Specific:
  • Deepgram connection issues: Verify your Deepgram API key in SettingsProvider Keys
  • Azure authentication failed: Verify subscription key and region match your Speech resource in SettingsProvider Keys
Testing Strategy: Record test calls with different timing settings and listen to the conversation flow. What feels natural to you will feel natural to callers.

Best Practices

  • Start with defaults and adjust based on testing
  • Test with real calls in your target environment
  • Use term boosting (keywords/keyterms/phrase lists) for your business-specific terminology
  • Enable denoising if you expect background noise
  • Monitor call quality and adjust timing as needed
  • Choose the right provider based on your primary needs (speed vs. language support)

How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

STT Settings control when the provider detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!
The Flow:
  1. STT detects speech using your timing settings (silence threshold, VAD, etc.)
  2. Call Management decides response using interruption and timeout settings
  3. Result: Natural conversation or awkward pauses
Key Relationships:
  • STT min_silence_duration (internal timeout) should be longer than Call Management interruption_cooldown
  • Lower STT endpointing (more responsive) works well with lower Call Management interruption_threshold
  • Higher STT timing settings pair well with patient Call Management idle_timeout
Next Step: Configure Call Management settings to control conversation flow after STT detects speech.