Deepgram Models Comparison
Model | Features | Keywords | Keyterms | Best For |
---|---|---|---|---|
Nova-3 | Latest, keyterms support | ❌ | ✅ | English calls, best accuracy |
Nova-2 | Keywords support | ✅ | ❌ | Multi-language, reliable |
Nova | Keywords support | ✅ | ❌ | Balanced performance |
Enhanced | Keywords support | ✅ | ❌ | Legacy support |
Base | Keywords support | ✅ | ❌ | Basic transcription |
Key Settings
Model & Language
Model & Language
- Model: Choose based on your needs (Nova-3 recommended for English)
- Language: Select from common options or enter a custom language code
- Custom Language: Enter any Deepgram-supported language code (e.g.,
fr-FR
,es-ES
)
Advanced Timing Controls
Advanced Timing Controls
These settings control how Deepgram detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.
Endpointing (Silence Threshold)
What it does: How long Deepgram waits after detecting silence before considering speech has ended.Technical Details:- Measured in: Milliseconds
- Default: 10ms (Deepgram’s minimal endpointing for real-time applications)
- Range: 10ms - 2000ms (recommended)
- Config Path:
stt_settings.endpointing.silence_threshold
- 10ms: Very responsive (default) - might cut off slow speakers
- 500ms: “I need help with…” → 0.5s silence → Deepgram says “speech ended”
- 1000ms: More patient (good for people who pause while thinking)
- Lower (10-100ms): For fast talkers or quick interactions (default)
- Higher (500-1000ms): For elderly callers or complex topics
- Much higher (1500ms+): For people with speech difficulties
Min Silence Duration
What it does: Internal timeout for utterance processing when Deepgram doesn’t sendspeech_final
(not sent to Deepgram API).Technical Details:- Measured in: Milliseconds
- Default: 1500ms
- Range: 500ms - 5000ms (recommended)
- Config Path:
stt_settings.endpointing.min_silence_duration
- Used for: Call handler utterance timeout logic when
speech_final
is missing
- 1500ms: Wait 1.5s for
speech_final
, then process accumulated utterance (default) - 1000ms: Quicker timeout for responsive conversation
- 2500ms: More patience for complex responses or noisy environments
- Lower (500-1000ms): For quick, responsive interactions
- Higher (2000-3000ms): For environments with background noise where
speech_final
may be unreliable - Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions
Utterance End Timeout
What it does: Maximum time Deepgram waits for a complete utterance before sending UtteranceEnd event.Technical Details:- Measured in: Milliseconds
- Default: 1000ms
- Range: 500ms - 5000ms (recommended)
- Config Path:
stt_settings.utterance_end_ms
- API Parameter:
utterance_end_ms
- 1000ms: If someone starts talking but doesn’t finish within 1 second, Deepgram sends UtteranceEnd (default)
- 500ms: Quick timeout (might cut off long sentences)
- 2000ms: Patient timeout (good for complex responses)
- Lower (500-800ms): For short, quick interactions
- Higher (1500-3000ms): For detailed conversations or forms
- Consider your use case: Customer service vs. quick orders
VAD Events
What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:- Type: Boolean (true/false)
- Default: true (enabled)
- Config Path:
stt_settings.vad_events
- API Parameter:
vad_events
- true: Enhanced speech detection with UtteranceEnd events when
speech_final
doesn’t work (recommended) - false: Basic speech detection only (legacy mode)
- Always recommended: Provides better speech detection in noisy environments
- Essential for: Background noise, poor connections, multiple speakers
- Backup mechanism: When
speech_final
doesn’t trigger due to audio issues
🎯 Timing Settings Quick Guide
Real-Time/Fast Conversations (Default):
- Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true
- Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true
- Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true
Processing Options
Processing Options
Keywords & Keyterms
Keywords & Keyterms
Keywords (Nova-2, Nova, Enhanced, Base):
- Boost recognition of specific words
- Format:
word:boost_factor
(e.g.,Deepgram:2.0, API:1.5
) - Great for company names, technical terms
- Advanced keyword detection
- Format:
word1, word2, word3
- More sophisticated than keywords
Audio Denoising
When to Enable:- Noisy environments (restaurants, offices, outdoors)
- Poor phone connections
- Background music or chatter
- Slightly increases latency (~50-100ms)
- Improves transcription accuracy in noisy conditions
Troubleshooting
Common STT Issues & Solutions
Common STT Issues & Solutions
Speech Detection Problems:
- AI misses words: Enable denoising or add keywords for important terms
- Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
- Long awkward pauses: Decrease min silence duration for faster internal processing
- Interrupts slow speakers: Increase endpointing and min silence duration
- Misses trailing words: Enable VAD events and increase utterance end timeout
- Wrong language detected: Set correct language code or use “custom” option
- Technical terms not recognized: Add them as keywords/keyterms with boost factors
- Company names garbled: Add company/product names to keywords list
- Noisy background: Enable audio denoising and increase VAD turnoff
- Poor phone connection: Enable denoising and use more conservative timing settings
- Multiple speakers: Use higher silence thresholds to avoid cross-talk issues
Best Practices
- Start with defaults and adjust based on testing
- Test with real calls in your target environment
- Use keywords for your business-specific terminology
- Enable denoising if you expect background noise
- Monitor call quality and adjust timing as needed
How STT Works with Call Management
🔗 STT + Call Management = Natural Conversations
STT Settings control when Deepgram detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!
- STT detects speech using your timing settings (silence threshold, VAD, etc.)
- Call Management decides response using interruption and timeout settings
- Result: Natural conversation or awkward pauses
- STT
min_silence_duration
(internal timeout) should be longer than Call Managementinterruption_cooldown
- Lower STT
endpointing
(more responsive) works well with lower Call Managementinterruption_threshold
- Higher STT timing settings pair well with patient Call Management
idle_timeout