Glossary · Updated July 2026

Voice typing glossary: 60+ speech-to-text and dictation terms, explained in plain English

Q: What is it called when speech is turned into text?

The general term is speech-to-text (STT), also called speech recognition or automatic speech recognition (ASR). When it happens live at your cursor it is usually called voice typing or dictation, and when it is applied to recorded audio it is called transcription. All five terms describe the same underlying technology used in different contexts.

Q: What are the techniques of speech-to-text?

Classic systems used a pipeline of an acoustic model (audio to phonemes) plus a language model (phonemes to likely words). Modern systems like Whisper and Parakeet are end-to-end neural networks — a single transformer model that maps audio directly to text. On top of either, tools differ in delivery technique (streaming vs batch recognition) and post-processing (auto-punctuation and LLM cleanup).

Q: What are some common speech-to-text errors?

The most common are substitutions (often homophones like their/there), proper-noun errors on names and jargon, dropped or inserted words, punctuation mistakes, and hallucinations — invented text during silence or background noise. A custom dictionary fixes most proper-noun errors, and voice activity detection prevents most hallucinations; FluidVox includes both.

Q: What are the terms related to speech recognition?

The core vocabulary includes ASR, speech-to-text (STT), text-to-speech (TTS), phoneme, utterance, Word Error Rate (WER), voice activity detection (VAD), diarization, endpointing, latency, streaming vs batch recognition, and on-device vs cloud transcription. This glossary defines all of them, plus the newer AI-dictation terms like LLM cleanup and per-app style matching.

Q: Is speech-to-text generative AI?

Partly. Classic speech recognition is not generative AI — it classifies sounds into words. But modern models like Whisper use the same transformer architecture as generative AI and literally generate the transcript token by token, and the LLM cleanup step that polishes a raw transcript is unambiguously generative AI. So a modern dictation app is best described as speech recognition plus a generative rewriting layer.

Q: What is the difference between speech-to-text and text-to-speech?

They are opposites. Speech-to-text (STT) listens to spoken audio and produces written text — that is what dictation and voice typing apps do. Text-to-speech (TTS) takes written text and synthesizes a spoken voice — that is what screen readers and voice assistants use to talk back to you.

Q: What is a good Word Error Rate?

On public benchmarks, the best open models average around 5-6% WER — NVIDIA's Parakeet-TDT 0.6B v2 topped the Hugging Face Open ASR Leaderboard at roughly 6% average WER as of 2026. Real-world results depend heavily on your microphone, accent, and vocabulary, so 5-10% is a realistic range for clean English speech. With a custom dictionary and LLM cleanup on top, the errors you actually have to fix by hand drop well below that.

Q: What is the difference between dictation and transcription?

Dictation is live: you speak and text appears at your cursor in real time, ready to send or edit. Transcription is retrospective: recorded audio (a meeting, lecture, or voice memo) is converted into a written document afterwards. Some apps only do one of these; FluidVox does both — system-wide dictation plus free on-device file transcription.

Q: What is the difference between on-device and cloud transcription?

On-device transcription runs the speech model locally on your computer, so audio never leaves your machine and it works offline. Cloud transcription sends audio to a remote server, which can offer more compute but requires internet and trust in the provider. FluidVox's Local plan is fully on-device; its Pro plans add an optional cloud engine.

Q: Do free built-in tools support all the features in this glossary?

No. Built-in tools like Apple Dictation, Windows Win+H voice typing, and Google Docs voice typing are free and fine for casual use, but they lack a custom dictionary, replacement rules, LLM cleanup, per-app style matching, and transcript history. Those features are what separate modern third-party dictation apps like FluidVox from the built-ins.

Speech recognition has fifty years of jargon, and AI dictation added a fresh layer on top of it. This glossary defines every term you'll meet while evaluating, configuring, or troubleshooting a voice typing tool — from WER and VAD to LLM cleanup and per-app style matching — grouped by topic so related terms sit together.

Start 14-day free trial

Five near-synonyms

Voice typing vs dictation vs transcription vs speech-to-text vs voice recognition

Term	What it usually means	Typical context
Voice typing	Live keyboard-replacement dictation that works in any app	Modern apps (FluidVox, Win+H, Google Docs)
Dictation	Speaking to produce written text; the older, broader word	Apple Dictation, medical/legal workflows
Transcription	Converting recorded audio into a written document	Meetings, lectures, podcasts, subtitles
Speech-to-text (STT)	The underlying technology, regardless of use	Engineering docs, APIs, research
Voice recognition	Strictly: identifying who is speaking — but widely used to mean speech recognition	Security, biometrics; casual product copy

In practice the first four overlap heavily — a good modern app does live voice typing and file transcription with the same speech-to-text engine. If you're new to the category, start with our overview of what voice typing is, then come back here when you hit an unfamiliar term.

Orientation

How the terms fit together

Almost every term below belongs to one stage of the same pipeline. You press a hotkey (push-to-talk or hands-free). Voice activity detection confirms you're speaking. A speech recognition model — on-device or cloud — converts audio to words, often streaming partial results as you talk. Endpointing decides you've finished. An optional LLM cleanup pass removes fillers and applies a style, your custom dictionary corrects names and jargon, and finally text injection places the result at your cursor.

We walk through this whole pipeline step by step in how AI dictation works.

Jump to a section

What's in this glossary

Core concepts — STT, ASR, dictation, wake word…
Models & engines — Whisper, Parakeet, transformers…
Accuracy & errors — WER, hallucination, homophones…
Audio & signal — VAD, phoneme, endpointing…
Dictation workflow — push-to-talk, latency, injection…
AI cleanup & formatting — LLM cleanup, fillers, styles…
Privacy, deployment & platforms — on-device, BYOK, Voice Access…

Section 1

Core concepts

Automatic speech recognition (ASR): The engineering term for technology that converts spoken audio into text. Used interchangeably with "speech recognition" and "speech-to-text" in research papers, API docs, and model names.
Dictation: Speaking with the intent of producing written text. Historically associated with dedicated medical, legal, and court-reporting software; today it overlaps almost completely with "voice typing."
Natural language processing (NLP): The broader AI field concerned with understanding and generating human language. In dictation tools, NLP shows up after recognition — punctuation prediction, filler removal, and LLM cleanup are all NLP.
Speech-to-text (STT): The general term for converting audio of speech into written text — the answer to "what is it called when speech is turned into text?" Synonymous with speech recognition and ASR.
Text-to-speech (TTS): The reverse of STT: synthesizing a spoken voice from written text. Screen readers, audiobook generators, and voice assistants replying out loud all use TTS. Easy to confuse in acronym form; the direction is the whole difference.
Transcription: Producing a written record from recorded audio — meetings, lectures, interviews, voice memos. Contrast with dictation, which is live. FluidVox does both: system-wide dictation plus on-device file transcription, free on every plan.
Voice command: A spoken instruction that triggers an action rather than inserting text — "delete that," "select all," "Hey Vox, translate this." Distinguishes dictation tools from voice-control tools like Windows Voice Access, which are built almost entirely around commands.
Voice recognition: Strictly, identifying who is speaking (speaker recognition, a biometric task) — not what they said. In everyday usage and product marketing it's used loosely to mean speech recognition, which causes endless confusion.
Voice typing: The modern term for live keyboard-replacement dictation that works in any app — you speak, text lands at your cursor. Emphasizes "type anywhere" over older single-app dictation. See our full guide: what is voice typing?
Wake word: A phrase that activates an always-listening assistant — "Hey Siri," "Alexa." Hotkey-based dictation tools deliberately avoid wake words: the microphone is only live while you hold or toggle the key, which is better for both privacy and false-trigger rates.

Section 2

Models & engines

Acoustic model: In classic speech recognition, the component that maps audio waveforms to phonemes (sound units). Combined with a language model to produce a transcript. Modern end-to-end models fold this step into a single network.
End-to-end model: A single neural network that maps audio directly to text, replacing the older acoustic-model-plus-language-model pipeline. Whisper and Parakeet are both end-to-end models — one reason modern accuracy jumped so sharply.
Language model: In classical ASR, the component that predicts likely word sequences so the system can choose "recognize speech" over "wreck a nice beach." The term now also covers the large language models (LLMs) used to polish transcripts after recognition.
Parakeet: A family of open-source speech recognition models from NVIDIA, optimized for speed and on-device use. Parakeet-TDT 0.6B v2 (released May 2025 under a commercial-friendly CC-BY-4.0 license) topped the Hugging Face Open ASR Leaderboard with roughly 6% average WER. FluidVox offers Parakeet as one of its on-device engines.
Quantization: Compressing a model's weights to lower numeric precision so it runs faster and fits in less memory, with a small accuracy cost. Quantization is a big part of why billion-parameter speech models now run comfortably on consumer laptops.
Real-time factor (RTF / RTFx): How fast a model transcribes relative to the audio's duration. RTF below 1 means faster than real time; RTFx is the inverse convention (higher is better), so an RTFx of 100 means one minute of audio transcribes in 0.6 seconds.
Transformer: The neural network architecture (introduced in 2017) behind both modern speech models like Whisper and the LLMs used for cleanup. Its ability to weigh context across a whole utterance is why modern transcription handles ambiguity so much better than older systems.
Whisper: OpenAI's open-source family of speech recognition models, first released in 2022 and trained on huge multilingual audio datasets. As of 2026, large-v3 is the highest-accuracy variant, and large-v3-turbo (October 2024) trims the decoder from 32 layers to 4 for near-v3 accuracy at several times the speed. Smaller variants (medium, small, base, tiny) trade accuracy for resource use. FluidVox runs Whisper models on-device.
whisper.cpp: A popular C/C++ port of Whisper optimized for efficient CPU and Apple Silicon inference. It's the engine behind many Mac dictation and transcription tools — and the free, DIY route we point Linux users to in our Superwhisper comparison, since neither app supports Linux.

Section 3

Accuracy & errors

Confidence score: A per-word or per-utterance probability the model attaches to its own output. Some tools use low-confidence spans to decide what to send to a cleanup model or flag for review.
Deletion / insertion / substitution: The three error types WER counts: a word the system dropped (deletion), a word it invented (insertion), and a word it swapped for another (substitution). Substitutions — especially homophones — are the most common in practice.
Hallucination: When a model outputs text that was never said — most often whole invented phrases during silence, music, or noise. Whisper is known to hallucinate on near-silent segments; voice activity detection (trimming silence before it reaches the model) is the standard mitigation.
Homophone error: Substituting a word that sounds identical but is spelled differently — their/there/they're, to/too/two. Language-model context resolves most of these; the stragglers are what LLM cleanup and your own proofread catch.
Open ASR Leaderboard: A public Hugging Face benchmark ranking speech recognition models by average WER across multiple English test sets, with speed (RTFx) reported alongside. The closest thing the field has to a neutral scoreboard for comparing engines.
Proper-noun error: Misrecognizing names, brands, acronyms, and domain jargon — the errors users notice most, because "Kubernetes" becoming "cooper netties" is embarrassing in a work message. A custom dictionary is the fix, since no general model knows your colleagues' names.
Word Error Rate (WER): The standard accuracy metric: substitutions + insertions + deletions, divided by the number of words in a reference transcript, expressed as a percentage. Lower is better; the best models score around 5-6% on benchmarks, and 5-10% is realistic for clean English in the real world. We unpack what actually moves this number in voice typing accuracy explained.

Section 4

Audio & signal

End-of-utterance detection (endpointing): How a system decides you've finished a phrase, usually via a silence threshold. Tuned too aggressive, it cuts you off mid-thought; too relaxed, and text takes ages to finalize. Push-to-talk sidesteps the guesswork — releasing the key is the endpoint.
Noise suppression: Filtering background sound (fans, keyboards, café chatter) from the microphone signal before recognition. Done by the OS, the mic hardware, or the app; good suppression measurably lowers WER in noisy rooms.
Phoneme: The smallest unit of sound that distinguishes words — English has roughly 44. Classic recognizers explicitly modeled phonemes; end-to-end models learn equivalent representations internally.
Sample rate: How many times per second audio is measured, in hertz. Speech models typically consume 16 kHz audio — far below music-quality 44.1 kHz, but ample for the frequency range of human speech.
Utterance: A continuous stretch of speech bounded by pauses — the basic unit recognition systems process and finalize. A dictated sentence is usually one utterance; a rambling paragraph may be several.
Voice Activity Detection (VAD): A lightweight detector that classifies each moment of audio as speech or non-speech. VAD trims silence, saves compute, feeds endpointing, and — critically — keeps silent stretches away from big models that would otherwise hallucinate text into them.

Section 5

Dictation workflow

Auto-learned corrections: A feature where the app notices you repeatedly fixing the same misrecognition after injection and proposes a replacement rule so it never happens again. FluidVox validates candidate corrections with AI before adopting them, so a one-off edit doesn't become a permanent rule.
Batch transcription: Processing a complete recording in one pass rather than streaming as it plays. Because the model sees full context, batch results can be slightly more accurate — it's the mode used for file transcription.
Custom dictionary: A user-managed list of words, names, and acronyms the system should preserve exactly. The single highest-leverage feature for anyone with technical or domain-specific vocabulary — and one that built-in tools like Apple Dictation lack.
File transcription: Importing audio or video files and getting a transcript back — the "MacWhisper" use case. In FluidVox this runs on-device and is free on every plan; Pro plans add cloud file transcription minutes on top.
Hands-free mode: A toggle activation mode: press once to start dictating, press again to stop — no key held down. Better for long-form dictation and essential for accessibility. In FluidVox it's Fn+Space on Mac and Ctrl+Shift+H on Windows by default.
Hotkey: The keyboard shortcut that activates dictation. Common defaults: Fn (Apple Dictation and FluidVox on macOS), Win+H (Windows voice typing), Ctrl+Shift+Space (FluidVox on Windows). Good tools let you rebind it.
Latency: The delay between finishing speaking and final text appearing at your cursor. Streaming recognition plus fast endpointing keeps perceived latency under a second; adding an LLM cleanup pass trades a moment more delay for polished output.
Partial results: The provisional words a streaming recognizer shows while you're still talking, revised as more context arrives and finalized at the end of the utterance. Why live dictation text sometimes visibly "changes its mind."
Per-app style matching: Automatically applying a different transcription style based on the active app — casual in Slack, professional in Outlook, code-formatted in VS Code. FluidVox ships this across its 6 styles; built-in dictation tools have no equivalent.
Push-to-talk (hold-to-talk): Hold a key while speaking, release to stop — the walkie-talkie model. It gives the system a perfect endpoint signal and guarantees the mic is only ever live while your finger says so. FluidVox's primary mode: hold Fn on Mac, Ctrl+Shift+Space on Windows.
Replacement rule: A find-and-replace applied automatically to every transcript — "j s" → "JS", "fluid vox" → "FluidVox". Created manually or auto-learned from your corrections; the companion to a custom dictionary.
Streaming recognition: Speech-to-text that produces partial results while you speak, finalized when the utterance ends. Lower perceived latency than batch recognition; the standard mode for live voice typing.
Text injection: Placing the finished text at your cursor in whatever app has focus, typically via OS accessibility APIs. This is the capability that separates true system-wide voice typing from apps where dictation only works inside their own window.
Transcript history: A searchable archive of past dictation sessions. Useful for recovering text that missed its target field; privacy-conscious tools (FluidVox included) store it locally rather than in the cloud.

Section 6

AI cleanup & formatting

Auto-punctuation: Inferring commas, periods, and question marks from speaking rhythm and grammar instead of making you say "comma" out loud. Standard in modern engines; older tools like the deprecated Windows Speech Recognition required spoken punctuation.
Disfluency: The false starts, repetitions, and mid-sentence self-corrections natural to speech — "we should, no wait, let's first…". Good AI cleanup resolves these to what you meant, keeping the correction and dropping the abandoned start.
Filler words: Verbal padding like "uh," "um," "like," "you know." A raw verbatim transcript keeps them; cleanup removes them — usually the most immediately visible difference between built-in dictation and an AI dictation app.
LLM cleanup: Post-processing a raw transcript with a large language model to remove fillers, fix grammar, repair disfluencies, and apply a style — the step that turns how you talk into how you write. See how AI dictation works for where it sits in the pipeline.
Transcription style: A preset controlling the tone and formatting of cleaned-up output. FluidVox has six — natural, casual, professional, code, notes, and email — selectable manually or applied automatically per app.
Verbatim vs clean transcript: A verbatim transcript records every um, stutter, and false start — required in legal and research contexts. A clean (or "intelligent verbatim") transcript reads like writing. Voice typing wants clean; know which one a tool produces before relying on it.

Section 7

Privacy, deployment & platforms

Apple Dictation: The dictation built into macOS and iOS (press Fn or the mic key). Free and processed on-device for many languages, but no custom dictionary, no styles, no LLM cleanup. We compare it feature-by-feature in our Apple Dictation alternative guide.
BYOK (bring your own key): Supplying your own API key for a cloud AI service instead of paying the app to proxy it. FluidVox's Local plan works this way for the Vox Agent — you plug in your own Gemini key; the Pro plans include AI usage so no key is needed.
Cloud transcription: Speech recognition performed on a remote server. Offloads compute from your machine and can be faster on weak hardware, but requires internet and trust in the provider's audio handling. FluidVox's Pro plans include a cloud engine covering 46 languages.
Diarization: Labeling who spoke each segment in multi-speaker audio ("Speaker 1:", "Speaker 2:"). Essential for meeting and interview transcription; irrelevant for single-voice dictation, which is why dictation-first apps often skip it.
Google Docs voice typing: The free Tools → Voice typing feature in Google Docs. Solid for drafting inside Docs, but it's browser-bound — it can't type into your email client, chat apps, or anything outside a Docs tab.
Offline mode: Full dictation with no internet connection — only possible when transcription is on-device. A quick litmus test for privacy claims: if an app keeps working in airplane mode, your audio genuinely isn't leaving the machine.
On-device transcription: Speech recognition performed locally on your own hardware. Privacy-preserving and offline-capable; large models may run slower on older machines. FluidVox's Local plan is fully on-device — audio never leaves the device, across 99 languages.
Voice Access: Microsoft's voice-control feature for Windows 11 (22H2 and later), which officially replaced the deprecated Windows Speech Recognition in September 2024. Focused on controlling the PC by voice, with dictation included. Full story in our Windows Speech Recognition alternative guide.
Win+H voice typing: Windows' built-in dictation, launched with the Win+H shortcut. Free and works in any text field, but speech is processed via Microsoft's online services on most PCs, and there's no custom dictionary or style control.

Voice typing glossary FAQ

What is it called when speech is turned into text?

What are the techniques of speech-to-text?

What are some common speech-to-text errors?

What are the terms related to speech recognition?

Is speech-to-text generative AI?

What is the difference between speech-to-text and text-to-speech?

What is a good Word Error Rate?

What is the difference between dictation and transcription?

What is the difference between on-device and cloud transcription?

Do free built-in tools support all the features in this glossary?

How does speech-to-text technology help students?

Which terms matter most when choosing a voice typing app?

Try FluidVox free for 14 days

Full access, no credit card required. Then $2.99/month or $39 one-time.

Start free trial