Learn

How AI dictation works

Modern voice typing combines streaming speech recognition with LLM-powered cleanup. Here's the four-stage pipeline that turns "uh let's schedule a meeting tomorow at 2 pm" into "Let's schedule a meeting tomorrow at 2 PM."

Start 14-day free trial

Stage 1

Audio capture

When you press the dictation hotkey, the app starts recording audio from your microphone. Audio is sampled at 16 kHz or 48 kHz and buffered in short chunks (typically 50–200 milliseconds each). For tools that support pre-connect buffering (FluidVox does this), audio is captured immediately and queued while the speech recognition service connects in the background — the user perceives no startup latency.

Stage 2

Speech recognition

Audio chunks flow to a speech recognition model. There are three common architectures:

  • Streaming WebSocket models (Deepgram, Azure Speech) — return partial results as you speak, finalize on silence.
  • Whisper-family transformer models (OpenAI Whisper, Parakeet) — process complete utterances in one pass, often used for file transcription. Can run on-device.
  • Hybrid systems — use streaming for live dictation and Whisper for fallback or file transcription.

The output is raw text — typically lowercase, with no punctuation and including filler words like "uh" and "um."

Stage 3

AI cleanup

The raw transcript flows to a language model (often Gemini, GPT, or Claude family) with a system prompt that says something like: "Clean this transcribed speech. Remove filler words. Add punctuation. Fix grammar. Apply casual / professional / technical tone for the active app: Slack."

The LLM returns polished text in 200–500ms typically. This step is what separates modern voice typing from older tools that skipped cleanup.

Many tools also apply local rules at this stage — custom dictionary substitutions, replacement rules, and personal corrections — to ensure accuracy on user-specific vocabulary.

Stage 4

Text injection

The cleaned text is inserted into the active text field. On macOS, this happens via the AXUIElement Accessibility API. On Windows, via SendInput simulating keyboard input. On iOS and Android, via custom keyboard extensions or accessibility services.

For streaming tools, the injection happens progressively — partial results appear as you speak, finalized when you release the hotkey.

Where it gets clever

Per-app context awareness

The piece that distinguishes 2026 voice typing from 2020 dictation: the system knows what app you're in. Before sending audio to recognition, the app captures the active app's bundle identifier (macOS) or process name (Windows). That identifier informs:

  • Which transcription style to apply (casual in Slack, professional in Outlook).
  • Whether to use a code-aware preset (technical style in VS Code, Cursor).
  • Which custom dictionary subset to prioritize.

FluidVox stores this mapping as per-app style categories — Messages, Email, Code, Notes, etc. — that you can configure or accept the defaults for.

On-device vs cloud trade-offs

Where the pipeline runs

The same four-stage pipeline can run entirely on-device or split across cloud services:

  • Fully on-device: Whisper or Parakeet runs locally for recognition. Local rules + small LLMs (or no LLM cleanup) handle stage 3. Examples: FluidVox Local, Aiko.
  • Hybrid: Streaming recognition via cloud API (Deepgram), LLM cleanup via cloud API (Gemini). Examples: FluidVox Pro, Wispr Flow.
  • Cloud-only: Both stages 2 and 3 in the cloud. Examples: Windows Voice Typing.

The trade-off is privacy vs latency vs hardware load. On-device transcription uses CPU/GPU and may be slower on older Macs. Cloud is faster on weak hardware but requires internet.

Frequently asked questions

Try FluidVox free for 14 days

Full access, no credit card required. Then $2.99/month or $39 one-time.

Start free trial