Learn

How AI dictation works

Q: How long does the full pipeline take?

On modern hardware with cloud transcription, end-to-end latency is typically under 1 second from end-of-speech to text appearing. On-device transcription on Apple Silicon is comparable; older Macs may add 1–2 seconds.

Q: What model powers FluidVox?

FluidVox uses Deepgram for streaming live dictation, Whisper or Parakeet for on-device file transcription, and Gemini for AI cleanup. The Local plan can run entirely on-device using Whisper/Parakeet.

Q: Why is the cleanup step optional?

Some users want a verbatim transcript without LLM editing — for journalism, legal, or other contexts where word-for-word accuracy matters more than polish. Most tools let you disable cleanup or pick a "raw" style.

Q: Can the LLM make things up during cleanup?

Modern LLMs are tuned to preserve meaning, but yes, hallucination is possible. Reputable tools constrain the LLM to a narrow editing task with strict prompts. If you notice significant rephrasing, switch to a more conservative style or disable cleanup.

Q: What about accents and dialects?

Modern speech recognition handles a wide range of accents on supported languages. Custom dictionaries help close the gap for words a model consistently mishears.

Q: How does the per-app awareness work technically?

macOS exposes the active app's bundle identifier via the NSWorkspace API. Windows exposes process executable name. The dictation tool reads this when you trigger the hotkey and applies the matching style.

Modern voice typing combines streaming speech recognition with LLM-powered cleanup. Here's the four-stage pipeline that turns "uh let's schedule a meeting tomorow at 2 pm" into "Let's schedule a meeting tomorrow at 2 PM."

Start 14-day free trial

Stage 1

Audio capture

When you press the dictation hotkey, the app starts recording audio from your microphone. Audio is sampled at 16 kHz or 48 kHz and buffered in short chunks (typically 50–200 milliseconds each). For tools that support pre-connect buffering (FluidVox does this), audio is captured immediately and queued while the speech recognition service connects in the background — the user perceives no startup latency.

Stage 2

Speech recognition

Audio chunks flow to a speech recognition model. There are three common architectures:

Streaming WebSocket models (Deepgram, Azure Speech) — return partial results as you speak, finalize on silence.
Whisper-family transformer models (OpenAI Whisper, Parakeet) — process complete utterances in one pass, often used for file transcription. Can run on-device.
Hybrid systems — use streaming for live dictation and Whisper for fallback or file transcription.

The output is raw text — typically lowercase, with no punctuation and including filler words like "uh" and "um."

Stage 3

AI cleanup

The raw transcript flows to a language model (often Gemini, GPT, or Claude family) with a system prompt that says something like: "Clean this transcribed speech. Remove filler words. Add punctuation. Fix grammar. Apply casual / professional / technical tone for the active app: Slack."

The LLM returns polished text in 200–500ms typically. This step is what separates modern voice typing from older tools that skipped cleanup.

Many tools also apply local rules at this stage — custom dictionary substitutions, replacement rules, and personal corrections — to ensure accuracy on user-specific vocabulary.

Stage 4

Text injection

The cleaned text is inserted into the active text field. On macOS, this happens via the AXUIElement Accessibility API. On Windows, via SendInput simulating keyboard input. On iOS and Android, via custom keyboard extensions or accessibility services.

For streaming tools, the injection happens progressively — partial results appear as you speak, finalized when you release the hotkey.

Where it gets clever

Per-app context awareness

The piece that distinguishes 2026 voice typing from 2020 dictation: the system knows what app you're in. Before sending audio to recognition, the app captures the active app's bundle identifier (macOS) or process name (Windows). That identifier informs:

Which transcription style to apply (casual in Slack, professional in Outlook).
Whether to use a code-aware preset (technical style in VS Code, Cursor).
Which custom dictionary subset to prioritize.

FluidVox stores this mapping as per-app style categories — Messages, Email, Code, Notes, etc. — that you can configure or accept the defaults for.

On-device vs cloud trade-offs

Where the pipeline runs

The same four-stage pipeline can run entirely on-device or split across cloud services:

Fully on-device: Whisper or Parakeet runs locally for recognition. Local rules + small LLMs (or no LLM cleanup) handle stage 3. Examples: FluidVox Local, Aiko.
Hybrid: Streaming recognition via cloud API (Deepgram), LLM cleanup via cloud API (Gemini). Examples: FluidVox Pro, Wispr Flow.
Cloud-only: Both stages 2 and 3 in the cloud. Examples: Windows Voice Typing.

The trade-off is privacy vs latency vs hardware load. On-device transcription uses CPU/GPU and may be slower on older Macs. Cloud is faster on weak hardware but requires internet.

Frequently asked questions

How long does the full pipeline take?

What model powers FluidVox?

Why is the cleanup step optional?

Can the LLM make things up during cleanup?

What about accents and dialects?

How does the per-app awareness work technically?

Try FluidVox free for 14 days

Full access, no credit card required. Then $2.99/month or $39 one-time.

Start free trial