Documentation Index
Fetch the complete documentation index at: https://docs.burki.dev/llms.txt
Use this file to discover all available pages before exploring further.
Speech-to-Text (STT) converts what callers say into text for your AI to understand. Burki Voice AI supports multiple STT providers—choose based on your needs for speed, language support, or enterprise features.
Provider Comparison
| Provider | Provider Key | Models | Best For | Notes |
|---|---|---|---|---|
| Deepgram | deepgram | nova-3, nova-2, nova, enhanced, base, flux-general-* | Fast phone calls, English keyterms, Flux conversational STT | Default STT provider |
| ElevenLabs | elevenlabs | scribe_v2_realtime | Multi-language realtime recognition | Uses elevenlabs_config |
| Azure Speech | azure | standard, enhanced, neural | Enterprise speech workloads | Available when Azure STT dependencies/config are present |
| Uplift | uplift | default, scribe, scribe-mini | Urdu/South Asian language workflows | Defaults to ur, µ-law, 8 kHz when unset |
| Speechmatics | speechmatics | enhanced, standard | Broad language coverage and hosted STT | Uses speechmatics_config |
| Telnyx STT | telnyx | deepgram/nova-3, deepgram/flux, Google/Azure/Telnyx Whisper hosted models | Telnyx-hosted transcription with carrier key reuse | Defaults to deepgram/nova-3 |
| Soniox | soniox | stt-rt-v4, stt-rt-v3 | Realtime STT with Soniox models | Uses soniox_config |
⚡ Deepgram
Ultra-Low Latency~100ms response time, optimized for phone calls. Nova-3 keyterms for English, Nova-2 for multi-language.
🎙️ ElevenLabs Scribe v2
Multi-Language ExcellenceVendor-reported low latency, broad language support, and advanced VAD-based speech detection.
☁️ Azure Speech
Enterprise Scale100+ languages, Microsoft ecosystem integration, phrase lists for term boosting, custom speech models.
🌐 Speechmatics
Hosted STTEnhanced and standard models with provider-specific config.
📡 Telnyx STT
Carrier-Hosted ModelsTelnyx-hosted Deepgram, Google, Azure, and Whisper model routes.
⚙️ Soniox / Uplift
Specialized Realtime STTSoniox realtime models and Uplift Scribe models.
Deepgram
Deepgram is the default STT provider, optimized for speed and phone call quality.Models
| Model | Features | Keywords | Keyterms | Best For |
|---|---|---|---|---|
| Nova-3 | Latest, keyterms support | ❌ | ✅ | English calls, best accuracy |
| Nova-2 | Keywords support | ✅ | ❌ | Multi-language, reliable |
| Nova | Keywords support | ✅ | ❌ | Balanced performance |
| Enhanced | Keywords support | ✅ | ❌ | Legacy support |
| Base | Keywords support | ✅ | ❌ | Basic transcription |
Flux (flux-general-en, flux-general-es, flux-general-multi) | Conversational realtime STT path | Provider-dependent | Provider-dependent | Fast turn-taking and mid-call Flux swaps |
Recommended: Use Nova-3 for English calls (supports keyterms) or Nova-2 for other languages (supports keywords).
Configuration
ElevenLabs Scribe v2
ElevenLabs Scribe v2 Configuration
ElevenLabs Scribe v2 Configuration
ElevenLabs Scribe v2 Realtime provides ultra-low latency speech recognition with excellent multi-language support and advanced voice activity detection.Key Features:
- Vendor-reported low-latency realtime recognition; actual accuracy depends on language, audio quality, and model configuration
- 90+ languages supported
- Advanced VAD-based commit strategy
- Word-level timestamps support
- Automatic language detection
- Sign up at ElevenLabs
- Get your API key from the dashboard
- Configure in assistant settings
📖 Full ElevenLabs Documentation
See the complete ElevenLabs Scribe v2 guide for VAD settings, language options, and best practices.
Azure Speech
Azure Speech Configuration
Azure Speech Configuration
Azure Speech provides managed speech recognition with broad language support and Microsoft ecosystem integration.Key Features:
- 100+ languages and regional variants
- Phrase lists for domain-specific term boosting
- Custom speech models for specialized vocabulary
- Speaker diarization support
- Create an Azure Speech resource using the Azure AI Speech quickstart
- Get your subscription key and region
- Configure in assistant settings
📖 Full Azure Documentation
See the complete Azure Speech STT guide for models, languages, configuration options, and best practices.
Additional Supported Providers
These providers are wired in the backend STT factory and can be selected bystt_settings.provider.
Uplift
default, scribe, scribe-mini.
Speechmatics
enhanced, standard. Provider-specific settings live under speechmatics_config.
Telnyx STT
deepgram/nova-3, deepgram/nova-2, deepgram/flux, Google, Azure, and Telnyx Whisper variants. Telnyx STT uses the organization’s Telnyx API key or managed carrier key.
Soniox
stt-rt-v4, stt-rt-v3. Provider-specific settings live under soniox_config.
OpenAI Whisper and Assembly appear in older enum/model mapping code but are not registered in the active STT factory. Do not configure them as live realtime STT providers unless the backend factory is updated.
Key Settings
Model & Language
Model & Language
- Provider: Choose from
deepgram,elevenlabs,azure,uplift,speechmatics,telnyx, orsoniox - Model: Choose based on your provider (
nova-3,flux-general-en,scribe_v2_realtime,standard,deepgram/nova-3,stt-rt-v4, etc.) - Language: Select from common options or enter a custom language code
- Custom Language: Enter any supported language code (e.g.,
fr-FR,es-ES)
Advanced Timing Controls
Advanced Timing Controls
These settings control how the STT provider detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.
Endpointing (Silence Threshold)
What it does: How long the provider waits after detecting silence before considering speech has ended.Technical Details:- Measured in: Milliseconds
- Default: 10ms (minimal endpointing for real-time applications)
- Range: 10ms - 2000ms (recommended)
- Config Path:
stt_settings.endpointing.silence_threshold
- 10ms: Very responsive (default) - might cut off slow speakers
- 500ms: “I need help with…” → 0.5s silence → Provider says “speech ended”
- 1000ms: More patient (good for people who pause while thinking)
- Lower (10-100ms): For fast talkers or quick interactions (default)
- Higher (500-1000ms): For elderly callers or complex topics
- Much higher (1500ms+): For people with speech difficulties
Min Silence Duration
What it does: Internal timeout for utterance processing when the provider doesn’t sendspeech_final (not sent to provider API).Technical Details:- Measured in: Milliseconds
- Default: 1500ms
- Range: 500ms - 5000ms (recommended)
- Config Path:
stt_settings.endpointing.min_silence_duration - Used for: Call handler utterance timeout logic when
speech_finalis missing
- 1500ms: Wait 1.5s for
speech_final, then process accumulated utterance (default) - 1000ms: Quicker timeout for responsive conversation
- 2500ms: More patience for complex responses or noisy environments
- Lower (500-1000ms): For quick, responsive interactions
- Higher (2000-3000ms): For environments with background noise where
speech_finalmay be unreliable - Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions
Utterance End Timeout
What it does: Maximum time the provider waits for a complete utterance before sending UtteranceEnd event.Technical Details:- Measured in: Milliseconds
- Default: 1000ms
- Range: 500ms - 5000ms (recommended)
- Config Path:
stt_settings.utterance_end_ms - API Parameter:
utterance_end_ms
- 1000ms: If someone starts talking but doesn’t finish within 1 second, provider sends UtteranceEnd (default)
- 500ms: Quick timeout (might cut off long sentences)
- 2000ms: Patient timeout (good for complex responses)
- Lower (500-800ms): For short, quick interactions
- Higher (1500-3000ms): For detailed conversations or forms
- Consider your use case: Customer service vs. quick orders
VAD Events
What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:- Type: Boolean (true/false)
- Default: true (enabled)
- Config Path:
stt_settings.vad_events - API Parameter:
vad_events
- true: Enhanced speech detection with UtteranceEnd events when
speech_finaldoesn’t work (recommended) - false: Basic speech detection only (legacy mode)
- Always recommended: Provides better speech detection in noisy environments
- Essential for: Background noise, poor connections, multiple speakers
- Backup mechanism: When
speech_finaldoesn’t trigger due to audio issues
🎯 Timing Settings Quick Guide
Real-Time/Fast Conversations (Default):
- Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true
- Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true
- Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true
Critical: These settings work together with Call Management interruption settings. Endpointing controls provider responsiveness, Min Silence Duration controls internal timeout handling, and both affect conversation flow timing.
Processing Options
Processing Options
Keywords & Keyterms
Keywords & Keyterms
Keywords (Deepgram Nova-2, Nova, Enhanced, Base):
- Boost recognition of specific words
- Format:
word:boost_factor(e.g.,Deepgram:2.0, API:1.5) - Great for company names, technical terms
- Advanced keyword detection
- Format:
word1, word2, word3 - More sophisticated than keywords
- Boost recognition of specific terms
- Format: Comma-separated list
- Works with all Azure models and languages
Use keywords/keyterms/phrase lists for your company name, product names, and industry-specific terms to improve accuracy.
Audio Denoising
Burki Voice AI includes RNNoise for real-time audio denoising, which removes background noise before transcription.
- Noisy environments (restaurants, offices, outdoors)
- Poor phone connections
- Background music or chatter
- Slightly increases latency (~50-100ms)
- Improves transcription accuracy in noisy conditions
Troubleshooting
Common STT Issues & Solutions
Common STT Issues & Solutions
Speech Detection Problems:
- AI misses words: Enable denoising or add keywords/phrase lists for important terms
- Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
- Long awkward pauses: Decrease min silence duration for faster internal processing
- Interrupts slow speakers: Increase endpointing and min silence duration
- Misses trailing words: Enable VAD events and increase utterance end timeout
- Wrong language detected: Set correct language code or use “custom” option
- Technical terms not recognized: Add them as keywords/keyterms/phrase lists with boost factors
- Company names garbled: Add company/product names to keywords list
- Noisy background: Enable audio denoising and increase VAD turnoff
- Poor phone connection: Enable denoising and use more conservative timing settings
- Multiple speakers: Use higher silence thresholds to avoid cross-talk issues
- Deepgram connection issues: Verify your Deepgram API key in Settings → Provider Keys
- Azure authentication failed: Verify subscription key and region match your Speech resource in Settings → Provider Keys
Testing Strategy: Record test calls with different timing settings and listen to the conversation flow. What feels natural to you will feel natural to callers.
Best Practices
- Start with defaults and adjust based on testing
- Test with real calls in your target environment
- Use term boosting (keywords/keyterms/phrase lists) for your business-specific terminology
- Enable denoising if you expect background noise
- Monitor call quality and adjust timing as needed
- Choose the right provider based on your primary needs (speed vs. language support)
How STT Works with Call Management
🔗 STT + Call Management = Natural Conversations
STT Settings control when the provider detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!
- STT detects speech using your timing settings (silence threshold, VAD, etc.)
- Call Management decides response using interruption and timeout settings
- Result: Natural conversation or awkward pauses
- STT
min_silence_duration(internal timeout) should be longer than Call Managementinterruption_cooldown - Lower STT
endpointing(more responsive) works well with lower Call Managementinterruption_threshold - Higher STT timing settings pair well with patient Call Management
idle_timeout
Next Step: Configure Call Management settings to control conversation flow after STT detects speech.