Skip to main content

Deepgram Models Comparison

ModelFeaturesKeywordsKeytermsBest For
Nova-3Latest, keyterms supportEnglish calls, best accuracy
Nova-2Keywords supportMulti-language, reliable
NovaKeywords supportBalanced performance
EnhancedKeywords supportLegacy support
BaseKeywords supportBasic transcription

Key Settings

  • Model: Choose based on your needs (Nova-3 recommended for English)
  • Language: Select from common options or enter a custom language code
  • Custom Language: Enter any Deepgram-supported language code (e.g., fr-FR, es-ES)
These settings control how Deepgram detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.

Endpointing (Silence Threshold)

What it does: How long Deepgram waits after detecting silence before considering speech has ended.Technical Details:
  • Measured in: Milliseconds
  • Default: 10ms (Deepgram’s minimal endpointing for real-time applications)
  • Range: 10ms - 2000ms (recommended)
  • Config Path: stt_settings.endpointing.silence_threshold
Real Example:
  • 10ms: Very responsive (default) - might cut off slow speakers
  • 500ms: “I need help with…” → 0.5s silence → Deepgram says “speech ended”
  • 1000ms: More patient (good for people who pause while thinking)
When to Adjust:
  • Lower (10-100ms): For fast talkers or quick interactions (default)
  • Higher (500-1000ms): For elderly callers or complex topics
  • Much higher (1500ms+): For people with speech difficulties

Min Silence Duration

What it does: Internal timeout for utterance processing when Deepgram doesn’t send speech_final (not sent to Deepgram API).Technical Details:
  • Measured in: Milliseconds
  • Default: 1500ms
  • Range: 500ms - 5000ms (recommended)
  • Config Path: stt_settings.endpointing.min_silence_duration
  • Used for: Call handler utterance timeout logic when speech_final is missing
Real Example:
  • 1500ms: Wait 1.5s for speech_final, then process accumulated utterance (default)
  • 1000ms: Quicker timeout for responsive conversation
  • 2500ms: More patience for complex responses or noisy environments
When to Adjust:
  • Lower (500-1000ms): For quick, responsive interactions
  • Higher (2000-3000ms): For environments with background noise where speech_final may be unreliable
  • Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions

Utterance End Timeout

What it does: Maximum time Deepgram waits for a complete utterance before sending UtteranceEnd event.Technical Details:
  • Measured in: Milliseconds
  • Default: 1000ms
  • Range: 500ms - 5000ms (recommended)
  • Config Path: stt_settings.utterance_end_ms
  • API Parameter: utterance_end_ms
Real Example:
  • 1000ms: If someone starts talking but doesn’t finish within 1 second, Deepgram sends UtteranceEnd (default)
  • 500ms: Quick timeout (might cut off long sentences)
  • 2000ms: Patient timeout (good for complex responses)
When to Adjust:
  • Lower (500-800ms): For short, quick interactions
  • Higher (1500-3000ms): For detailed conversations or forms
  • Consider your use case: Customer service vs. quick orders

VAD Events

What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:
  • Type: Boolean (true/false)
  • Default: true (enabled)
  • Config Path: stt_settings.vad_events
  • API Parameter: vad_events
Real Example:
  • true: Enhanced speech detection with UtteranceEnd events when speech_final doesn’t work (recommended)
  • false: Basic speech detection only (legacy mode)
When to Enable:
  • Always recommended: Provides better speech detection in noisy environments
  • Essential for: Background noise, poor connections, multiple speakers
  • Backup mechanism: When speech_final doesn’t trigger due to audio issues
Why It Matters: VAD events provide UtteranceEnd signals as a fallback when normal speech detection fails due to background noise or audio quality issues.

🎯 Timing Settings Quick Guide

Real-Time/Fast Conversations (Default):
  • Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true
Balanced Professional:
  • Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true
Patient/Elderly Callers:
  • Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true
Keywords (Nova-2, Nova, Enhanced, Base):
  • Boost recognition of specific words
  • Format: word:boost_factor (e.g., Deepgram:2.0, API:1.5)
  • Great for company names, technical terms
Keyterms (Nova-3 only, English only):
  • Advanced keyword detection
  • Format: word1, word2, word3
  • More sophisticated than keywords

Audio Denoising

When to Enable:
  • Noisy environments (restaurants, offices, outdoors)
  • Poor phone connections
  • Background music or chatter
Trade-offs:
  • Slightly increases latency (~50-100ms)
  • Improves transcription accuracy in noisy conditions

Troubleshooting

Speech Detection Problems:
  • AI misses words: Enable denoising or add keywords for important terms
  • Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
  • Long awkward pauses: Decrease min silence duration for faster internal processing
  • Interrupts slow speakers: Increase endpointing and min silence duration
  • Misses trailing words: Enable VAD events and increase utterance end timeout
Language & Recognition:
  • Wrong language detected: Set correct language code or use “custom” option
  • Technical terms not recognized: Add them as keywords/keyterms with boost factors
  • Company names garbled: Add company/product names to keywords list
Audio Quality:
  • Noisy background: Enable audio denoising and increase VAD turnoff
  • Poor phone connection: Enable denoising and use more conservative timing settings
  • Multiple speakers: Use higher silence thresholds to avoid cross-talk issues

Best Practices

  • Start with defaults and adjust based on testing
  • Test with real calls in your target environment
  • Use keywords for your business-specific terminology
  • Enable denoising if you expect background noise
  • Monitor call quality and adjust timing as needed

How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

STT Settings control when Deepgram detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!
The Flow:
  1. STT detects speech using your timing settings (silence threshold, VAD, etc.)
  2. Call Management decides response using interruption and timeout settings
  3. Result: Natural conversation or awkward pauses
Key Relationships:
  • STT min_silence_duration (internal timeout) should be longer than Call Management interruption_cooldown
  • Lower STT endpointing (more responsive) works well with lower Call Management interruption_threshold
  • Higher STT timing settings pair well with patient Call Management idle_timeout