STT Providers & Settings

Deepgram Models Comparison

Model	Features	Keywords	Keyterms	Best For
Nova-3	Latest, keyterms support	❌	✅	English calls, best accuracy
Nova-2	Keywords support	✅	❌	Multi-language, reliable
Nova	Keywords support	✅	❌	Balanced performance
Enhanced	Keywords support	✅	❌	Legacy support
Base	Keywords support	✅	❌	Basic transcription

Key Settings

Model & Language

Model: Choose based on your needs (Nova-3 recommended for English)
Language: Select from common options or enter a custom language code
Custom Language: Enter any Deepgram-supported language code (e.g., fr-FR, es-ES)

Advanced Timing Controls

These settings control how Deepgram detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.

Endpointing (Silence Threshold)

What it does: How long Deepgram waits after detecting silence before considering speech has ended.Technical Details:

Measured in: Milliseconds
Default: 10ms (Deepgram’s minimal endpointing for real-time applications)
Range: 10ms - 2000ms (recommended)
Config Path: stt_settings.endpointing.silence_threshold

Real Example:

10ms: Very responsive (default) - might cut off slow speakers
500ms: “I need help with…” → 0.5s silence → Deepgram says “speech ended”
1000ms: More patient (good for people who pause while thinking)

When to Adjust:

Lower (10-100ms): For fast talkers or quick interactions (default)
Higher (500-1000ms): For elderly callers or complex topics
Much higher (1500ms+): For people with speech difficulties

Min Silence Duration

What it does: Internal timeout for utterance processing when Deepgram doesn’t send speech_final (not sent to Deepgram API).Technical Details:

Measured in: Milliseconds
Default: 1500ms
Range: 500ms - 5000ms (recommended)
Config Path: stt_settings.endpointing.min_silence_duration
Used for: Call handler utterance timeout logic when speech_final is missing

Real Example:

1500ms: Wait 1.5s for speech_final, then process accumulated utterance (default)
1000ms: Quicker timeout for responsive conversation
2500ms: More patience for complex responses or noisy environments

When to Adjust:

Lower (500-1000ms): For quick, responsive interactions
Higher (2000-3000ms): For environments with background noise where speech_final may be unreliable
Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions

Utterance End Timeout

What it does: Maximum time Deepgram waits for a complete utterance before sending UtteranceEnd event.Technical Details:

Measured in: Milliseconds
Default: 1000ms
Range: 500ms - 5000ms (recommended)
Config Path: stt_settings.utterance_end_ms
API Parameter: utterance_end_ms

Real Example:

1000ms: If someone starts talking but doesn’t finish within 1 second, Deepgram sends UtteranceEnd (default)
500ms: Quick timeout (might cut off long sentences)
2000ms: Patient timeout (good for complex responses)

When to Adjust:

Lower (500-800ms): For short, quick interactions
Higher (1500-3000ms): For detailed conversations or forms
Consider your use case: Customer service vs. quick orders

VAD Events

What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:

Type: Boolean (true/false)
Default: true (enabled)
Config Path: stt_settings.vad_events
API Parameter: vad_events

Real Example:

true: Enhanced speech detection with UtteranceEnd events when speech_final doesn’t work (recommended)
false: Basic speech detection only (legacy mode)

When to Enable:

Always recommended: Provides better speech detection in noisy environments
Essential for: Background noise, poor connections, multiple speakers
Backup mechanism: When speech_final doesn’t trigger due to audio issues

Why It Matters: VAD events provide UtteranceEnd signals as a fallback when normal speech detection fails due to background noise or audio quality issues.

🎯 Timing Settings Quick Guide

Real-Time/Fast Conversations (Default):

Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true

Balanced Professional:

Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true

Patient/Elderly Callers:

Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true

Processing Options

Keywords & Keyterms

Keywords (Nova-2, Nova, Enhanced, Base):

Boost recognition of specific words
Format: word:boost_factor (e.g., Deepgram:2.0, API:1.5)
Great for company names, technical terms

Keyterms (Nova-3 only, English only):

Advanced keyword detection
Format: word1, word2, word3
More sophisticated than keywords

Audio Denoising

When to Enable:

Noisy environments (restaurants, offices, outdoors)
Poor phone connections
Background music or chatter

Trade-offs:

Slightly increases latency (~50-100ms)
Improves transcription accuracy in noisy conditions

Troubleshooting

Common STT Issues & Solutions

Speech Detection Problems:

AI misses words: Enable denoising or add keywords for important terms
Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
Long awkward pauses: Decrease min silence duration for faster internal processing
Interrupts slow speakers: Increase endpointing and min silence duration
Misses trailing words: Enable VAD events and increase utterance end timeout

Language & Recognition:

Wrong language detected: Set correct language code or use “custom” option
Technical terms not recognized: Add them as keywords/keyterms with boost factors
Company names garbled: Add company/product names to keywords list

Audio Quality:

Noisy background: Enable audio denoising and increase VAD turnoff
Poor phone connection: Enable denoising and use more conservative timing settings
Multiple speakers: Use higher silence thresholds to avoid cross-talk issues

Best Practices

Start with defaults and adjust based on testing
Test with real calls in your target environment
Use keywords for your business-specific terminology
Enable denoising if you expect background noise
Monitor call quality and adjust timing as needed

How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

STT Settings control when Deepgram detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!

The Flow:

STT detects speech using your timing settings (silence threshold, VAD, etc.)
Call Management decides response using interruption and timeout settings
Result: Natural conversation or awkward pauses

Key Relationships:

STT min_silence_duration (internal timeout) should be longer than Call Management interruption_cooldown
Lower STT endpointing (more responsive) works well with lower Call Management interruption_threshold
Higher STT timing settings pair well with patient Call Management idle_timeout

Getting Started

Core Concepts

AI Providers

Features

Advanced

Help & Resources

Deepgram Models Comparison

Key Settings

Endpointing (Silence Threshold)

Min Silence Duration

Utterance End Timeout

VAD Events

🎯 Timing Settings Quick Guide

Audio Denoising

Troubleshooting

Best Practices

How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

Getting Started

Core Concepts

AI Providers

Features

Advanced

Help & Resources

​Deepgram Models Comparison

​Key Settings

​Endpointing (Silence Threshold)

​Min Silence Duration

​Utterance End Timeout

​VAD Events

🎯 Timing Settings Quick Guide

​Audio Denoising

​Troubleshooting

​Best Practices

​How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

Deepgram Models Comparison

Key Settings

Endpointing (Silence Threshold)

Min Silence Duration

Utterance End Timeout

VAD Events

Audio Denoising

Troubleshooting

Best Practices

How STT Works with Call Management