Glossary

Voice typing glossary

Definitions for the terminology you'll encounter while evaluating, configuring, or troubleshooting voice typing and AI dictation tools.

Start 14-day free trial
Acoustic model
A speech recognition component that maps audio waveforms to phonemes (sound units). Combined with a language model to produce a transcript.
Auto-punctuation
Automatic insertion of commas, periods, and other punctuation by the speech recognition system based on speaking patterns and pauses.
Cloud transcription
Speech recognition performed on a remote server (e.g., Azure Speech, Deepgram). Faster than on-device on weak hardware but requires internet and trust in the provider.
Custom dictionary
A user-managed list of words, names, and acronyms that should be preserved exactly during transcription. Critical for technical or domain-specific vocabulary.
Diarization
Speaker identification in multi-speaker recordings — labeling who spoke each segment.
Dictation
Historical term for converting speech to text, often associated with dedicated apps for medical, legal, or court-reporting use. Now overlaps heavily with "voice typing."
End-of-utterance detection
The mechanism by which a speech recognition system determines that the speaker has finished a phrase. Often based on silence threshold.
Filler words
Verbal fillers like "uh," "um," "like," "you know" that AI cleanup typically removes from transcripts.
Hands-free toggle
A dictation activation mode where the user toggles dictation on/off rather than holding a key. Useful for sustained dictation sessions and accessibility.
Hotkey
A keyboard shortcut that activates dictation. Common defaults: Fn (macOS), Win+H (Windows), Ctrl+Shift+Space (FluidVox Windows default).
LLM cleanup
Post-processing of a raw transcript by a language model to remove fillers, fix grammar, add punctuation, and apply tone matching.
Language model
In speech recognition, the component that predicts likely word sequences to improve accuracy. In modern AI cleanup, an LLM that polishes the raw transcript.
On-device transcription
Speech recognition performed locally on the user's computer. Privacy-preserving, works offline, may be slower on older hardware.
Parakeet
A family of open-source speech recognition models from NVIDIA, optimized for streaming and on-device deployment.
Per-app tone matching
Automatic adjustment of transcription style based on the active app — casual in Slack, professional in Outlook, technical in VS Code.
Punctuation prediction
A speech recognition feature that infers commas, periods, question marks, and other punctuation from speech patterns.
Speech-to-text (STT)
The general term for converting audio of speech into written text. Synonymous with speech recognition.
Streaming recognition
Speech-to-text that produces partial results as the user speaks, finalized when the utterance ends. Lower perceived latency than batch recognition.
Transcript history
A searchable archive of past dictation sessions, typically stored locally for privacy.
Voice command
A spoken instruction that triggers an action other than text insertion — e.g., "Hey Vox, translate this," "delete that," "select all."
Voice typing
The modern term for live keyboard-replacement dictation that works in any app. Emphasizes the "type anywhere" capability over older single-app dictation tools.
Whisper
OpenAI's family of open-source speech recognition models. Whisper large v2 is the highest-accuracy variant; smaller variants (medium, small, base, tiny) trade accuracy for speed and resource use.
Word Error Rate (WER)
A standard metric for speech recognition accuracy: the percentage of words wrong in a transcript compared to a reference. Lower is better; modern English models achieve 5–10%.

Frequently asked questions

Try FluidVox free for 14 days

Full access, no credit card required. Then $2.99/month or $39 one-time.

Start free trial