By Denys Medvediev

Explainer

The NVIDIA Parakeet model

Parakeet is NVIDIA's open speech-to-text model. The current build, parakeet-tdt-0.6b-v3, is about 600 MB, runs offline, and is 5 to 10 times faster than Whisper on a CPU. Here's what it is and how it stacks up.

Last updated: June 2026

Abstract blue audio waveform over a processor chip, evoking on-device speech recognition

The NVIDIA Parakeet model is an open speech-to-text model built on a FastConformer encoder and a Token-and-Duration Transducer decoder. The current release, parakeet-tdt-0.6b-v3, has about 600 million parameters, transcribes 25 European languages including English, and runs 5 to 10 times faster than Whisper on a CPU. It does not translate to English.

Most people meet the word "Parakeet" expecting a bird and leave with a speech-to-text model. It's NVIDIA's, it's open under a permissive license, and the version that matters for everyday dictation is called parakeet-tdt-0.6b-v3. The "0.6b" is the parameter count — about 600 million. On disk it lands around 600 MB. That's small enough to live on your laptop and never call a server.

I care about this for an unglamorous reason: we ship it. Parakeet is one of the local engines inside Whisper, sitting right next to OpenAI's Whisper models, and the question I get most is "which one, and why is the bird so fast." So this is the straight version — what Parakeet actually is, how its decoder makes it quick, and the exact line where I'd hand you to Whisper instead.

Here's the thing the model-card jargon buries. Parakeet is a transcription model and only a transcription model. It listens to audio and writes down the words, with punctuation and capitalization included. It does not summarize, it does not translate to English, and it does not take hotwords. What it does, it does very fast.

So the useful framing isn't "Parakeet vs Whisper, which wins." It's "what is each one for." Parakeet is the fast English-and-European pick that runs fully offline. Whisper is the 99-language, translate-to-English, fine-control pick that's slower on the same machine. I'll explain the speed, give you the language list straight, and show you how to run Parakeet free, locally, in about two minutes.

What the Parakeet model actually is

Close-up of a circuit board with a glowing processor, representing local on-device transcription

Parakeet is a family of automatic speech recognition models released by NVIDIA. The one we ship, and the one most people mean, is parakeet-tdt-0.6b-v3, released in August 2025 under the CC-BY-4.0 license. "0.6b" is 600 million parameters. The download is roughly 600 MB. Inside Whisper it arrives as an ONNX model run through transcribe-rs, our pure-Rust transcription layer, which means no Python runtime and no separate process to babysit.

Its job is narrow and it's honest about it. Parakeet takes audio and returns text with automatic punctuation and capitalization, plus word-level timestamps if you ask. It detects the language on its own — you don't tell it what you're speaking. What it doesn't do is just as important: no translation to English, no custom-vocabulary biasing, no "boost these words" hotword list. It transcribes. That's the whole contract.

The "TDT" in the name is the interesting bit, and it's why the model is quick rather than just small. TDT stands for Token-and-Duration Transducer. The encoder is a FastConformer, which is NVIDIA's efficient take on the Conformer architecture that most modern speech models use. The pairing — fast encoder, clever decoder — is the engineering behind the headline number, and it's worth one section on its own.

How a Token-and-Duration Transducer goes fast

Older transducer models walk through audio one tiny frame at a time and, at each frame, ask "is there a new word piece here, or not." Most of the time the answer is "not" — they emit a blank, shuffle forward one frame, and ask again. That blank-emitting loop is most of the work and most of the wasted time. It's the speech-model equivalent of reading a sentence one pixel at a time.

A Token-and-Duration Transducer changes the question. Instead of only predicting the next token, it predicts the token and how many frames to skip before the next one. When there's a stretch of one long vowel or a pause, the model jumps over it in a single step rather than grinding frame by frame. Fewer decoding steps, same words out. That duration prediction is the trick the "TDT" name is pointing at, and it's where the speed comes from.

Cancel
The recording overlay: a small capsule that appears while you speak, so you know Whisper is listening.

From your chair, none of that shows. You hold a hotkey, you talk, you release, and the text lands at your cursor — the overlay capsule above is the only thing you see while it listens. The decoder math is hidden plumbing. But it's why Parakeet finishes a chunk of audio while a comparable Whisper model is still chewing on the blanks, and on a CPU that gap is the difference between "instant" and "wait for it."

Parakeet vs. Whisper, without the marketing

People treat this like a cage match. It isn't. They're two tools with different shapes, and inside our app you can keep both installed and switch per recording. The cleanest way to hold it in your head: Parakeet optimizes for speed and offline simplicity; Whisper optimizes for coverage and control.

Parakeet is faster — 5 to 10 times faster than Whisper on a CPU, by NVIDIA's own framing and our own runs. It covers 25 languages, all European, English among them. It punctuates and capitalizes for free. What it gives up: it can't translate other languages into English, it has no hotword or custom-vocabulary biasing, and it doesn't touch the dozens of non-European languages — Chinese, Japanese, Korean, Arabic, Hindi — that Whisper's multilingual builds handle without blinking.

Whisper, in OpenAI's multilingual builds, reaches 99 languages and will translate any of them to English. It also exposes the knobs Parakeet doesn't: beam-size, an initial prompt, hotword biasing for names and jargon. The cost is wall-clock time on the same hardware, and bigger models mean more RAM. So the rule of thumb is plain: if you speak English or another European language and you want it now, Parakeet. If you need translation, a non-European language, or fine control, Whisper. The boring truth is most people who try both end up keeping both.

The real numbers: speed, and 25 languages

A glowing world map with light trails connecting cities, evoking many languages and fast processing

Start with speed, because it's the reason Parakeet exists in our app at all. NVIDIA's stated figure is 5 to 10 times faster than Whisper on a CPU, and that matches what we see. On the public Open ASR Leaderboard the model posts a real-time factor in the thousands — meaning it can transcribe far faster than the audio plays back when it's given a fat GPU. You won't have that GPU. But even on a plain laptop CPU, the duration-skipping decoder keeps a short dictation feeling instant rather than laggy.

Now the language list, stated precisely so you don't get burned. Parakeet v3 handles 25 languages, all European, with English as one of them — so English plus 24 others, not 99. The set runs from the obvious (English, French, German, Spanish, Italian, Portuguese, Dutch, Polish) through the Nordics and the Baltics to Russian and Ukrainian. It auto-detects which one you're speaking. If a model page or a forum tells you Parakeet does 99 languages, it's confusing it with Whisper. It does 25, and it does them quickly.

Two more limits worth saying out loud, because they're the ones people trip on. Parakeet has no translate-to-English mode — it transcribes whatever you said in the language you said it, full stop. And it takes no hotwords, so if your dictation is full of unusual product names or surnames, you can't pre-feed them. Neither is a flaw; they're just the edges of a fast, focused model. (The accuracy on plain English is genuinely good — on the standard clean-speech benchmark it sits under 2% word error rate — but "good" and "tunable for your weird jargon" are different promises.)

Run Parakeet free, locally, in two minutes

You don't need an NVIDIA account, a Python install, or a GPU to try this. You need a Mac on Apple Silicon or a Windows 10-or-newer PC, a working microphone, and a few minutes. The whole local pipeline — Parakeet included — is free for any signed-in account, with no payment method asked for at sign-up. Here's the sequence.

Step 1 — Install Whisper and sign in.

Download from the download page, install, and create a free account. No card. The whole local transcription pipeline opens right away.

You'll know it worked when the app's tray icon appears and the setup wizard offers to pick a model.

Step 2 — Choose Local Parakeet.

The app presents three paths and doesn't choose for you: Cloud, Local Parakeet, Local Whisper. Pick Local Parakeet and let the roughly 600 MB model download once.

You'll know it worked when Parakeet finishes downloading and shows as ready.

Step 3 — Confirm your hotkey.

Windows defaults to Ctrl+Space, Mac to Command+Option held as push-to-talk. On Mac, grant the Accessibility permission when prompted; without it, the paste-at-cursor can't reach other apps.

You'll know it worked when a test recording pastes into any text field.

Step 4 — Put your cursor anywhere and talk.

Click into any text box — an email, a doc, a chat — hold the hotkey, say a sentence, release. Parakeet transcribes it and the text appears where the cursor is.

You'll know it worked when your spoken sentence is sitting in the field as text, a beat after you let go.

Whisper
The real Whisper desktop app on the settings screen, with the Transcription panel where you pick Parakeet.

The slow part is that one model download. Everything after is the four steps above, and once Parakeet is on disk it never phones home — the audio and the transcription both stay on your machine. If you've ever set up dictation on Windows or on Mac, this is the same flow with a faster engine underneath.

Accuracy, run-ons, and cleaning up the text

Raw dictation from any engine, Parakeet included, comes out as a run-on. You say "okay so move the standup to ten file the parakeet draft and ping marco," and that's the unpunctuated wall you get. Parakeet does add its own punctuation and capitalization, which is more than a lot of models do, but it isn't going to strip your "ums" or reshape a rambling thought into a clean line.

That's where an AI pass earns its keep. Say the activation phrase "Hey whisper" and the transcribed text gets enhanced before it lands — filler removed, run-ons split, the spoken mess turned into something you'd actually send. On a local setup that runs through Ollama on your own machine; in cloud mode it's gpt-5-mini by default. Parakeet does the listening, the enhancement does the tidying.

Thinking...
Raw

okay so move the standup to ten file the parakeet draft and ping marco um before lunch

Cleaned

Okay, so move the standup to ten, file the Parakeet draft, and ping Marco before lunch.

On accuracy itself, Parakeet is genuinely strong on clean English — under 2% word error rate on the standard benchmark, which is in Whisper's neighbourhood, not a step below it. The honest caveat is the one nobody markets: no model fixes bad audio. A $20 USB mic does more for your transcription accuracy than swapping engines ever will. I learned that the dull way, after blaming the model for a week of garbled recordings that turned out to be my laptop's built-in mic picking up the dishwasher.

That same speak-then-clean habit pays off well beyond one app — you can type whole emails and docs with your voice using the one hotkey, so a long paragraph becomes a few spoken sentences instead of something you grind out on the keyboard.

When to pick Whisper instead of Parakeet

Two arrows chalked on pavement pointing different directions, illustrating a tool choice

I'd be doing you a disservice if I sold Parakeet as the answer to everything. It's the fast pick, not the universal one, and there are clear cases where I'd reach past it for one of the Whisper models — or for the free dictation already on your machine.

Pick Whisper over Parakeet when any of these is true. You need a language outside Parakeet's 25 — Chinese, Japanese, Korean, Arabic, Hindi, anything non-European — because Parakeet simply doesn't cover them. You need translate-to-English, which Parakeet has no mode for. Or you dictate heavy jargon, unusual names, or product terms and want hotword biasing to lock them in, which only Whisper exposes. For any of those, Whisper's multilingual builds and their 99-language reach are the right tool, even though they run slower on the same laptop.

And sometimes the right tool isn't ours at all. If you only ever drop a 20-word note into a text field, your operating system already does that for free: Windows key + H opens Voice Typing wherever your cursor is (it needs internet, so it isn't offline), and on a Mac, Dictation under System Settings → Keyboard types anywhere you can, processed on-device on Apple Silicon. Below the threshold where speed, offline privacy, or a clean AI pass actually matter, use what's free. I'm not going to tell you to install an engine for a one-line reminder.

If you're choosing a setup on an Apple machine specifically, the trade-offs between Parakeet, Whisper, and Apple's own dictation are laid out in the best speech-to-text options for Mac, which walks the same speed-versus-coverage call from the Mac side.

Parakeet is a 600 MB model named after a bird that does one thing — turn European speech into text, fast, on your own machine — and refuses to pretend it does more. I find that restraint oddly reassuring in a year where every tool claims to do everything. I dictated the messy first draft of this explainer with Parakeet running locally, then let the AI pass clean up the run-ons, then switched to a Whisper model for one quoted line in Ukrainian that Parakeet handled fine but I wanted to translate. Two engines, one hotkey, no servers. That's the whole point of having both.

Try Parakeet on your own machine

Hold the hotkey, talk, release. Parakeet transcribes it locally and the text lands at your cursor — in every app you open.

Free local mode for any signed-in account. No card required to start.

Photo of Denys Medvediev

Denys Medvediev

I'm the one who reads our support email, most probably by dictating the replies.