By Denys Medvediev

Explainer

Private speech to text, on-device

Private speech to text means your voice is transcribed on your own device, with nothing uploaded to a server. Local Whisper and Parakeet run fully offline. Cloud dictation, by contrast, sends your audio away to be transcribed.

Last updated: June 2026

A padlock resting on a laptop keyboard in low light, evoking on-device privacy

Private speech to text is transcription that runs on the user's own device, so the recorded voice never leaves the machine. Local engines like Whisper and Parakeet work fully offline with nothing sent to a server. Cloud dictation services upload audio to be transcribed remotely. For maximum privacy, choose a local, offline tool.

Every dictation tool calls itself private. Most of them are not. The word gets stretched to mean "we encrypt the upload" or "we delete it after 30 days," which still means your voice left your machine, sat on someone's server, and was transcribed by a computer you don't own. That is a privacy policy, not privacy. There is a real, narrow version of the word, and it's worth getting straight before you trust a tool with the email to your lawyer.

The honest definition is simple: private speech to text means the audio is turned into words on your device, and nothing is sent anywhere. No upload, no server, no internet required. That version exists, it's free for the local pipeline, and it runs on the laptop you already own. The catch — and I'll be straight about it — is that the moment you opt into a cloud mode for better accuracy, that promise changes. I'll draw that line clearly.

Here's the thing that gets buried under marketing. "Private" is not a feeling, it's a question with one answer: does the audio leave the device or not. If it leaves, someone other than you can, in principle, hear it. If it doesn't, they can't. Everything else — encryption, retention windows, compliance badges — is damage control for the case where it does leave.

So the real question isn't "is this tool private." It's "does my voice get transcribed on my machine or on theirs." Local Whisper and Parakeet do it on yours, offline, with the model loaded into your own RAM. Cloud dictation does it on theirs. This guide explains what that distinction actually buys you, how to set up the local version in two minutes, and the one honest exception where sending audio to the cloud is a reasonable trade.

What "private" actually means for speech to text

A closed padlock sitting on a laptop trackpad, illustrating on-device data privacy

Private speech to text means one specific thing: your recorded voice is converted to text on your own device, and the audio never leaves it. No upload to a server, no round-trip over the internet, no third party in the loop. The transcription happens in your own memory and CPU, the way spell-check happens, and then the audio is gone. That's the whole definition, and most tools that use the word "private" don't meet it.

What usually gets sold as "private" is the cloud version with a better lock on the door. The audio still travels to a vendor's servers to be transcribed; the vendor just promises to encrypt it in transit and delete it on some schedule. That's genuinely better than nothing, and for a lot of people it's fine. But it is not the same as the audio never leaving. A promise to delete is a promise. On-device processing is a fact — there's nothing to delete because nothing was sent. When privacy actually matters — a salary figure, a medical note, a draft you'd never want indexed — the difference between a promise and a fact is the whole game.

The reason on-device transcription is even possible now is that the models got small and the laptops got fast. A few years ago you needed a data center to run good speech recognition, which is why everything went to the cloud. Today an open Whisper model runs locally on a mid-range laptop and Parakeet runs faster than that. The cloud was a workaround for hardware that no longer holds you back. Private speech to text isn't a premium feature you pay extra for — it's the default that became practical, and the rest of this guide is about using it.

Why most cloud dictation isn't private

When you press a key in a cloud dictation tool, this is what happens under the hood: your microphone records a few seconds of audio, that audio file is sent over the internet to a server, a model on that server transcribes it, and the text comes back to your screen. The whole thing can take barely a second, which is exactly why it feels invisible. But your voice — the actual recording, not just the words — made a trip to a machine you don't control and back.

Windows Voice Typing is the clearest example, because most people already have it. Press Windows key + H and a little bar opens that types your speech into whatever field has focus. It works well. It is also a cloud service — Microsoft's online speech recognition — which is why it needs an internet connection and stops working on a plane. Your audio goes to Microsoft's servers to become text. The same is true of most "AI dictation" apps shipping today: the clever part runs on someone else's hardware, and a quiet monthly invoice is the cost of renting it. A local tool shows a small capsule while it listens, and the audio it records never leaves the laptop:

Cancel
The recording overlay: a small capsule that appears while you speak. With a local engine, the audio it captures is transcribed on-device and never uploaded.

I'm not saying cloud transcription is evil — I'll defend it later for the cases where it earns its keep. I'm saying the marketing word "private" usually describes the lock on the upload, not the absence of an upload. Cloud-only dictation is a privacy disaster waiting to be transcribed, and the people who feel it first are the ones who can't see the bill. I once watched a team rack up a five-figure cloud-AI charge in a single quarter, mostly from a "smart retry" bug that re-sent the same standup recordings four times over. The CFO opened the dashboard at the quarterly review and the room went very quiet. Nobody had decided to send all that audio to a server. The tool just did, every time, because that's how it worked.

How local speech to text keeps it private

The private version runs entirely on your machine. You press a hotkey, speak, release, and a model that's already loaded into your own RAM turns the audio into text and pastes it at your cursor — no internet, no server, nothing sent. You need a Mac on Apple Silicon or a Windows 10-or-newer PC, a working microphone, and a couple of minutes. The whole local pipeline is free for any signed-in account, with no payment method asked for at sign-up. Here's the sequence.

Step 1 — Install Whisper and sign in.

Download from the download page, install, and create a free account. No card. The whole local transcription pipeline opens right away, offline.

You'll know it worked when the app's tray icon appears and the setup wizard offers to pick a model.

Step 2 — Pick a local transcription path.

The app doesn't choose for you. For private, offline dictation, pick Local Parakeet or Local Whisper — both run on your machine. The third option, Cloud, uploads audio, so leave it off if privacy is the point.

You'll know it worked when a local model finishes downloading and shows as ready.

Step 3 — Confirm your hotkey.

Windows defaults to Ctrl+Space, Mac to Command+Option held as push-to-talk. On Mac, grant the Accessibility permission when prompted; without it, the paste-at-cursor can't reach other apps.

You'll know it worked when a test recording pastes into any text field.

Step 4 — Pull out your network cable and talk anyway.

This is the privacy test. Turn off Wi-Fi, put your cursor in any text box, hold the hotkey, say a sentence, release. The transcript still appears, because the model ran locally.

You'll know it worked when dictation works with the internet switched off entirely.

Whisper
The real Whisper desktop app on the settings screen, with the local Transcription and AI panels open.

The slow part is the one-time model download, which obviously needs the internet. After that, the audio never goes online again in local mode. The pull-the-cable test in step four isn't a gimmick — it's the only proof that matters. If dictation keeps working with the network off, the audio is being transcribed on your device, full stop. If it stops, it was going somewhere. That single test cuts through every "private" claim on every marketing page.

voice to text on Windows · on Mac

Even the AI cleanup can stay on your machine

Here's the part most people don't realize they should ask about. Raw dictation comes out as a run-on — no punctuation, the occasional "um," sentences that ramble. The fix is an AI pass that tidies the text into something you'd actually keep. And this is exactly where a lot of "private" local tools quietly phone home: they transcribe on-device, then ship the messy transcript off to a cloud model for the cleanup. The audio stayed private; the words didn't.

Whisper handles the cleanup locally too, through Ollama — a free local model runner that sits on your machine at localhost and never touches the internet. Say the activation phrase "Hey whisper" and the text gets enhanced before it lands at your cursor, with the whole round-trip happening inside your laptop. So the chain stays unbroken: your voice becomes text on your device, and that text gets cleaned on your device. Nothing about the sentence — not the audio, not the draft, not the tidied version — ever leaves.

This is the detail I'd check on any tool that calls itself private. It's easy to keep the transcription local and sneak the enhancement into the cloud, because the enhancement is the bit that needs a big model, and big models are tempting to rent. The boring truth is that for everyday dictation, a local model through Ollama is more than enough to fix punctuation and strip filler. You only need a cloud model when you're asking for something genuinely harder, and that's a choice you should make on purpose — not one the tool makes for you in the background.

Local or cloud: which mode for a private workflow

For anything you'd call private, start local. If your Mac is Apple Silicon or your PC is from the last few years, the local engines handle everyday dictation without complaint, and the cloud becomes the escape hatch rather than the default. The app makes you pick a path on purpose — it doesn't push a default — so here's how the three differ, with privacy in plain sight:

The choice comes down to where the audio is processed and what you need from the transcription.

  • Local ParakeetNVIDIA's TDT engine, around 600 MB, and the fastest local option — 5 to 10 times faster than Whisper on CPU. Covers English plus 24 other European languages, 25 in total. No translate-to-English. Fully on-device, nothing uploaded. The quick private pick if you speak English or another European language.
  • Local Whisperslower than Parakeet on the same machine, but the multilingual builds cover 99 languages and can translate to English. The English-only builds are English-only, not 99. Also fully on-device. Pick this for Chinese, Japanese, Korean, or any translation work, which Parakeet can't do. Default English model is around 480 MB.
  • Cloud (OpenAI, BYOK)best accuracy and web access, using your own OpenAI key billed straight by OpenAI. Transcription defaults to gpt-4o-mini-transcribe. This is the one path that uploads your audio — it leaves your machine to reach OpenAI. It's opt-in, part of Whisper Pro, and off unless you turn it on.

The line is clean: the two local paths are private by construction — the audio is transcribed on your device and there's nothing to leak. The cloud path is not, and we don't pretend otherwise. It sends your audio to OpenAI, under your own key, because that's the only way to get OpenAI's accuracy and live web access. If your Mac is M-series or your PC is recent, start with local mode and only reach for cloud when local genuinely leaves you wanting. Cloud is the exception you choose, not the default you inherit.

What actually leaves your machine, in each mode

Let's be concrete about the data, because "private" is meaningless without naming what travels. In local mode, the answer is nothing — not the audio, not the transcript, not the cleaned-up version. The recording is processed in your RAM, the cleanup runs through Ollama on your machine, and the only thing that ever moved was the words, from the model into your text box. You can verify it with the network unplugged.

When the AI cleanup runs, the overlay shows an enhancing state while the local model fixes the run-on into something readable. Here's the kind of transform it does — the raw dictation on top, the cleaned text below — all of it happening on your device when you're in local mode:

Thinking...
The overlay during the AI cleanup pass. In local mode this runs through Ollama on your machine, so the text never leaves.
Raw

okay so send the q3 numbers to marcus before the board call and flag the margin dip um but dont cc the whole finance list

Cleaned

Okay, so send the Q3 numbers to Marcus before the board call and flag the margin dip — but don't cc the whole finance list.

In cloud mode, the honest accounting is different and you should know it before you flip the switch. Your audio is uploaded to OpenAI's transcription endpoint, under your own API key, to be turned into text there. If you also use Cloud AI enhancement, the transcript goes to a GPT model; if you use web search, a query goes out too. None of it routes through Remskill — it's a direct line from your machine to OpenAI on your key — but it does leave your machine, which is the only thing that defines whether something is private. That sentence about the Q3 numbers and Marcus is exactly the kind of thing I'd keep local. A recipe I'm dictating for fun, I genuinely do not care.

That same speak-then-clean flow works in every app, so once it's set up you can type faster with your voice across your editor, your email, and your terminal — privately, with nothing leaving the laptop in local mode.

When sending audio to the cloud is a fair trade

A balance scale on a desk, illustrating weighing privacy against accuracy

I'd be lying if I said local is always the answer. Sometimes the cloud is the right call, and pretending otherwise to push a privacy angle would be the same marketing dishonesty I just spent six sections complaining about. The trade is real: you give up the never-leaves-your-machine guarantee, and you get the best transcription accuracy available plus live web access in the same hotkey.

Reach for cloud mode when the content isn't sensitive and the accuracy is. A podcast transcript, a public blog draft, a grocery list, a hard recording with a thick accent or a noisy room where the local model stumbles — none of that needs to stay on your machine, and OpenAI's models will get it cleaner. You're using your own API key, so the audio goes to OpenAI directly and the per-minute cost lands on you, not through a middleman markup. For non-sensitive work where quality is what you're paying for, that's a sensible trade. The mistake isn't using cloud — it's using cloud by default for everything, including the things you'd never want on a server.

And for the genuinely short stuff, skip the dedicated tool entirely. If you're dictating a 30-word text, Windows key + H or macOS Dictation is free and already installed — though note Windows Voice Typing is itself a cloud service, so it's not the private option, just the convenient one. On Apple Silicon, macOS Dictation can process general text on-device, which makes it the one built-in that's actually private for short snippets. Below the 200-word mark, I'm not going to tell you to install anything. The dedicated tool earns its place when notes get long, when you want offline privacy on Windows, or when you want one hotkey that behaves the same everywhere.

If you're picking a tool mainly for the privacy guarantee, the deeper version of this argument lives in the guide to offline speech to text which walks through running everything with the network unplugged.

"Private" is the most overused word in this category and the easiest to test: unplug the network and see if it still works. Local Whisper and Parakeet pass that test because the audio never leaves your machine, and the AI cleanup passes it too because Ollama runs right there beside them. Cloud mode fails it on purpose, because it's renting OpenAI's accuracy, and that's a fair trade for the right job. I dictated most of this guide with the Wi-Fi off, which is either a strong product demo or a sign I need to get out more. Both can be true.

Dictate privately, starting now

Pick a local model, unplug the network, and talk. The transcript lands at your cursor — and your voice never left the laptop.

Free local mode for any signed-in account. No card required to start.

Photo of Denys Medvediev

Denys Medvediev

I'm the one who reads our support email, most probably by dictating the replies.