Voice typing terminology is fragmented — speech recognition has its own vocabulary, modern AI dictation has another. This glossary collects both into a single quick reference for evaluating tools.
Glossary
Voice typing glossary
Definitions for the terminology you'll encounter while evaluating, configuring, or troubleshooting voice typing and AI dictation tools.
Start 14-day free trial- Acoustic model
- A speech recognition component that maps audio waveforms to phonemes (sound units). Combined with a language model to produce a transcript.
- Auto-punctuation
- Automatic insertion of commas, periods, and other punctuation by the speech recognition system based on speaking patterns and pauses.
- Cloud transcription
- Speech recognition performed on a remote server (e.g., Azure Speech, Deepgram). Faster than on-device on weak hardware but requires internet and trust in the provider.
- Custom dictionary
- A user-managed list of words, names, and acronyms that should be preserved exactly during transcription. Critical for technical or domain-specific vocabulary.
- Diarization
- Speaker identification in multi-speaker recordings — labeling who spoke each segment.
- Dictation
- Historical term for converting speech to text, often associated with dedicated apps for medical, legal, or court-reporting use. Now overlaps heavily with "voice typing."
- End-of-utterance detection
- The mechanism by which a speech recognition system determines that the speaker has finished a phrase. Often based on silence threshold.
- Filler words
- Verbal fillers like "uh," "um," "like," "you know" that AI cleanup typically removes from transcripts.
- Hands-free toggle
- A dictation activation mode where the user toggles dictation on/off rather than holding a key. Useful for sustained dictation sessions and accessibility.
- Hotkey
- A keyboard shortcut that activates dictation. Common defaults: Fn (macOS), Win+H (Windows), Ctrl+Shift+Space (FluidVox Windows default).
- LLM cleanup
- Post-processing of a raw transcript by a language model to remove fillers, fix grammar, add punctuation, and apply tone matching.
- Language model
- In speech recognition, the component that predicts likely word sequences to improve accuracy. In modern AI cleanup, an LLM that polishes the raw transcript.
- On-device transcription
- Speech recognition performed locally on the user's computer. Privacy-preserving, works offline, may be slower on older hardware.
- Parakeet
- A family of open-source speech recognition models from NVIDIA, optimized for streaming and on-device deployment.
- Per-app tone matching
- Automatic adjustment of transcription style based on the active app — casual in Slack, professional in Outlook, technical in VS Code.
- Punctuation prediction
- A speech recognition feature that infers commas, periods, question marks, and other punctuation from speech patterns.
- Speech-to-text (STT)
- The general term for converting audio of speech into written text. Synonymous with speech recognition.
- Streaming recognition
- Speech-to-text that produces partial results as the user speaks, finalized when the utterance ends. Lower perceived latency than batch recognition.
- Transcript history
- A searchable archive of past dictation sessions, typically stored locally for privacy.
- Voice command
- A spoken instruction that triggers an action other than text insertion — e.g., "Hey Vox, translate this," "delete that," "select all."
- Voice typing
- The modern term for live keyboard-replacement dictation that works in any app. Emphasizes the "type anywhere" capability over older single-app dictation tools.
- Whisper
- OpenAI's family of open-source speech recognition models. Whisper large v2 is the highest-accuracy variant; smaller variants (medium, small, base, tiny) trade accuracy for speed and resource use.
- Word Error Rate (WER)
- A standard metric for speech recognition accuracy: the percentage of words wrong in a transcript compared to a reference. Lower is better; modern English models achieve 5–10%.
Frequently asked questions
Probably WER (Word Error Rate). It's the standard accuracy metric, and understanding what affects it (microphone, accent, vocabulary) helps you set realistic expectations.
No. Free built-in tools like Apple Dictation and Win+H lack many features defined here (custom dictionary, per-app tone, LLM cleanup). Modern third-party tools like FluidVox include them.
May 2026.
Try FluidVox free for 14 days
Full access, no credit card required. Then $2.99/month or $39 one-time.
Start free trial