Guide
Real-time transcription, explained
Two apps wear the same name and do opposite jobs. Here is how to tell live captioning from near-instant dictation, and pick the one your problem actually needs.
Last updated: June 2026

A real-time transcription app turns spoken words into text the moment you speak, with no upload-and-wait step. Two kinds exist: continuous live captioning that streams a transcript during meetings, and near-instant dictation that transcribes on a hotkey release and pastes at the cursor. Which one you want depends on whether you are watching a conversation or writing something.
A finance team I worked with once built their own "real-time transcription" tool. A contractor wired GPT-4 to every laptop's microphone and let it run. At quarter's end the manager opened the cloud dashboard to a five-figure bill. Most of it was one team transcribing standup recordings four times over, because the "smart retry" logic was too eager. The contractor said they should optimize the prompt. The CFO said something shorter. The phrase "real-time transcription" had meant something none of them agreed on.
That mismatch is the whole problem with this category. Two people say "real-time transcription app" and mean two different machines. One wants the words to scroll up the screen while a colleague talks in a Zoom call. The other wants to hold a key, say a sentence, release, and watch it appear in the email they are already writing. This article sorts out which is which, shows how the fast local version works, and tells you when to use a meeting tool instead. By the end you will know which category fits your problem. Most people pick the wrong one on day one. I know, because most of the support email I read is from people who did just that, and I spent my first month answering each one by hand before I thought to explain the difference up front.
The split matters because the two designs are good at opposite jobs. Live captioning is built to never stop. It follows a meeting for an hour and you read along. Dictation is built to end fast: you talk for fifteen seconds, the text lands, you keep working. A cold sales email is twelve variants of eighty words, about twelve minutes by voice against forty-five minutes by hand. A lecture summary is a ninety-minute recording squeezed into a six-hundred-word note. Same phrase on the search bar, two different tools.
What 'real time' actually means

Two honest definitions of "real time" exist, and the apps that claim it split into two camps.
The first is continuous live captioning. The transcript appears word by word while audio is still playing: a meeting, a lecture, a video. You read the text as it streams. Otter does this during calls, with live captions for Zoom and Google Meet. Maestra advertises real-time transcription and translation in 125+ languages with a free live tier. Windows 11 has Live Captions built in, on-device and offline, across about 21 languages. These watch a stream and narrate it.
The second is near-instant dictation. You hold a hotkey, speak a sentence or a paragraph, release, and the finished text appears where your cursor already is. No streaming caption. A short pause measured in a second or two, then the whole block lands. This is what Whisper by Remskill does. It transcribes on hotkey release and pastes at the cursor; the microphone stays open 500 milliseconds after you let go, to catch the last word people trail off on.
Both are "real time" in the sense that matters to a human: you do not record a file, upload it, and wait. But they solve different problems. Live captioning is a reading tool; you are consuming someone else's speech. Dictation is a writing tool; you are producing your own. Mixing them up is how you end up paying a meeting-notes subscription to answer a one-line email, or fighting a dictation app to caption a webinar it was never built to follow.
A third thing gets lumped in here that is not real time at all: file transcription. You record an interview, upload the audio, and the tool returns a transcript a few minutes later. Tools like Rev and Trint are built more for that kind of work, and it is a different job: editing a finished recording, not capturing speech as it happens. It is worth naming so you can rule it out. If you are waiting on an upload progress bar, you are not using a real-time app, whatever the marketing says.
So the category has a shape once you see it. Reading speech that is happening now: live captions. Writing speech that you are saying now: dictation. Cleaning up a recording from earlier: file transcription. The search term "real time transcription app" collides the first two and pulls in the third by accident. Sorting yourself into the right one is the most useful thing you can do before you install anything.
Press a hotkey, get text at the cursor
Here is the dictation loop, start to finish. You press the hotkey: Ctrl+Space on Windows, or Command+Option held together on macOS, a push-to-talk chord where you keep both keys down while you talk and release either one to stop. You speak. You let go. A small overlay shows the app transcribing, and a second or two later the text is sitting in whatever app you were already in: the email, the document, the chat box, the code comment.
No window to switch to. No "copy from the transcription tab and paste it back." The text arrives at the cursor because that is the entire point. You were writing, and now you are writing faster. The overlay above is the live app, not a screenshot; that transcribing state is what you see in the half-second between releasing the key and the words appearing.
Here is why "real time" feels different than it does in a caption stream. A caption is something you watch happen to someone else. Dictation is something that happens to your own sentence, fast enough that you do not lose the thread of what you were saying. The 500-millisecond tail buffer exists for that reason. People drop their voice at the end of a sentence, and cutting the mic the instant the key lifts would clip the last word. Small detail. It is the difference between "thanks for organizing the tri" and a complete sentence.
It helps to see why the timing lands where it does. When you release the key, the audio you just spoke is already captured in memory. The model runs on that short clip, a sentence or a paragraph, not on a live stream, which is why the result arrives as one finished block instead of scrolling word by word. A short clip is fast to process; that is the trick. A live-caption tool has to keep decoding an open stream and show partial guesses that it revises as more audio arrives. Dictation skips all of that. It waits for you to finish, then transcribes once, in a clean pass.
That design choice is what keeps you in flow. In my experience the thing that breaks dictation is delay: once the gap stretches past a second or two, I find my attention drifts back to the app I was in and I lose the thread of the sentence I was halfway through. That is an opinion from using the loop daily, not a published spec. Short clips plus a fast local engine keep the gap short. It is the gap worth caring about, and it is the reason the loop feels like writing rather than like dictating-and-waiting.
If you want the deeper version of how the whole pipeline fits together, we wrote a separate piece on how Whisper turns a hotkey press into pasted text. This is the short version: press, speak, release, done.
Why Parakeet is the fastest local option
Local transcription used to mean slow. That stopped being true when NVIDIA's Parakeet model showed up. In the Whisper app, Parakeet's own in-app description is "5-10× faster than Whisper on CPU," covering English plus 24 European languages, at about 600 MB on disk. That speed is what makes local dictation feel near-instant instead of near-coffee-break. It is the whole reason the hotkey loop above works without a server in the middle.
You are not locked to one engine. Whisper by Remskill ships two local options. Parakeet supports 25 languages (English plus 24 European ones) but no Asian languages and no translate-to-English. The faster-whisper engine covers more ground: the multilingual builds handle 99 languages with auto-detect, while the .en builds are English-only, one language, in exchange for being smaller and quicker. The Whisper models run from a ~140 MB English Base up to a ~3 GB multilingual Large v3, with a ~1.62 GB Large v3 Turbo in between for people who want most of the accuracy at a fraction of the wait.
The app does not pick for you, and that is deliberate. The embed above is the real settings surface. You choose Parakeet if you mostly speak English and want raw speed, or a Whisper model if you need 99-language coverage or translate-to-English. I spent an embarrassing afternoon trying to auto-select the "best" engine for people before I admitted the only person who knows which one is right is the person doing the talking. The trade is real: Parakeet is the fastest and smallest, but it cannot do Chinese, Japanese or Korean, and it cannot translate. The multilingual Whisper builds can do all of that, at the cost of a larger model and a longer wait per clip. Neither is "better" in the abstract; one is better for your specific mouth and your specific languages.
A cloud path also exists that brings your own OpenAI key: transcription via gpt-4o-mini-transcribe or gpt-4o-transcribe, with text cleanup handled by gpt-5-mini by default. Cloud needs internet; the local engines do not. The cloud path is the escape hatch, not the starting point. If a four-year-old laptop runs the local engines fine, and most do, you never need a server in the loop for a paragraph of email.
Sit with that part for a moment. Cloud-only dictation is a privacy disaster. Your boss's salary spreadsheet, the email to your kid's school, the legal brief on the train: none of that should land in a vendor's logs because you wanted to type with your voice. Local mode runs on-device and works offline after the one-time model download; nothing is sent to any server during local transcription. That finance team's five-figure quarter happened because the words left the building. It was avoidable.
If you want the longer argument, here is our case for offline speech to text that never phones home.
Live captions for meetings vs. dictation at your cursor

Pick the tool by what you are doing, not by which one says "real time" loudest.
If you are in a meeting and need the conversation captured as it happens (multiple speakers, an hour long, with a summary after) you want continuous live captioning. That is a reading-and-recording job. Otter, Maestra, Google Meet's built-in captions, Windows 11 Live Captions: they follow a stream and write it down. Windows 11 captions any audio playing on your screen, on-device and offline, but it captions the audio read-only. It does not type the words into the app you are working in.
That Windows distinction trips a lot of people up. Live Captions reads audio that is playing (a video, a call, a colleague's voice through your speakers) and shows it on screen for you to read. It does not put text into the document you are writing. That is the line between a reading tool and a writing tool: same on-device transcription engine underneath, a different destination for the words. One sends them to a caption bar you read. The other sends them to the cursor you are typing at.
If you are writing an email, a document, a Slack message, a commit note, you want dictation. You are producing the words, not transcribing someone else's. You want them at the cursor, fast, then gone. That is the hotkey loop. A live-caption tool will transcribe you in a sense, but it dumps the text in its own window and leaves you to copy it across, which defeats the speed you came for.
A few concrete cases make the split obvious. A salesperson dictating CRM notes between calls (fifty words, one key press, thirty seconds) is dictation. A team running a weekly planning call that needs a searchable transcript and action items afterward is live captioning. A student turning a ninety-minute lecture into a six-hundred-word summary wants captioning during the lecture and then a tool to compress it. A parent answering a teacher's email while packing lunchboxes wants dictation, because they are writing a reply, not recording the kitchen. The same person can need both in one day. They are still two different tools.
The rule: watching speech → live captions; writing by speech → dictation. A few apps blur the line, but most of the frustration in this category comes from using a meeting tool to write or a writing tool to caption a meeting. Whisper sits in the writing camp: near-instant, cursor-first, push-to-talk. It is the same loop whether you are dictating in Gmail or anywhere else with a text field.
The other real-time transcription apps worth knowing
You do not have to take my word on the category. Here is the honest one-line read on the main players, so you can place each one before you commit.
- Otter covers meeting transcription with live captions for Zoom and Google Meet, apps on iOS, Android and Web, and AI transcription in English, Spanish, French, German, Japanese and Chinese. The free tier caps you at 300 transcription minutes a month.
- Maestra advertises real-time transcription and translation in 125+ languages, plus subtitles and dubbing, with a live transcription tier the company says is free. Built for captions and subtitles, not cursor dictation.
- Notta does real-time audio-and-video to text and reports support for 58 languages with translation. A meeting-and-recording tool, cloud-based.
- Rev and Trint position themselves more around recorded media than cursor dictation. Rev is best known for transcription of meetings and recordings; Trint is widely used in journalism and newsroom workflows for working with recorded interviews. Both are reading-and-editing tools, not a hold-a-key-and-type-into-your-app loop.
Notice the pattern: most of these are meeting-and-recording tools that live in the cloud. That is the whole market for "live transcription apps." The dictation-at-your-cursor camp, the writing tool, is the smaller and quieter category, and it is the one most people searching this term need without knowing the name for it.
To place these side by side on the parts you can verify, not on invented speed or accuracy scores:
| Tool | Platform | Local / Cloud | Works offline | Pricing model | Languages | Best for |
|---|---|---|---|---|---|---|
| Whisper by Remskill | Windows, macOS (Apple Silicon) | Local + optional cloud (BYOK) | Yes, local mode | Free local tier; paid cloud add-on | 99 (Whisper multilingual) / 25 (Parakeet) | Dictation at your cursor |
| Otter | iOS, Android, Web | Cloud | No | Free tier + paid plans | 6 | Meeting live captions |
| Maestra | Web | Cloud | No | Free live tier + paid plans | 125+ | Subtitles, dubbing, captions |
| Notta | Web, mobile | Cloud | No | Free tier + paid plans | 58 (reported) | Meeting and recording notes |
| Windows 11 Live Captions | Windows 11 | Local (on-device) | Yes | Built into the OS | ~21 | On-screen captions to read |
Why this market looks the way it does is worth a sentence. Meetings are where the money is. A company will pay per seat to capture every call, summarize it, and pipe action items into a project tracker. That is a recurring, expensable line item. Personal writing-by-voice is not. So the loud, well-funded half of the category is built for conference rooms, and the half that helps one person answer their email faster gets less marketing oxygen. The phrase "real time transcription app" sits on top of both, which is why people land on a meeting tool when they wanted a typing tool. If you want the wider field laid out by category, we keep a running guide to transcription software across categories.
When to skip Whisper and use a meeting tool
I will say it straight, because the alternative is selling you the wrong thing. If your job is capturing a live meeting (several people talking, an hour of it, a tidy summary at the end) do not use Whisper for that. Use Otter. It is built for this, with live captions for Zoom and Google Meet and apps on every platform, and the free tier gives you 300 minutes a month to test it. For multilingual subtitles or dubbing, Maestra's live tier covers 125+ languages. And if you only need captions of audio already playing on your Windows screen, Windows 11 Live Captions is free, on-device, and already installed. We make a writing tool. When you need a reading tool, those are the better picks, and we would rather you used the right one. (For the side-by-side on the meeting case, we wrote a whole Otter.ai alternative breakdown.)
What it costs
Whisper by Remskill is free for every signed-in user across the entire local pipeline (Parakeet, all the Whisper models, on-device AI cleanup, history, presets, custom hotkeys) with no payment method asked for at signup. The paid tier, Whisper Pro, adds the cloud surface: bring-your-own-key OpenAI transcription and web search. The exact numbers live on the pricing page, and they do not move around with "starting at" footnotes. For context on the others: Otter's free tier stops at 300 minutes a month, with paid plans above that. The point of the free local pipeline is that you can test the whole writing loop, hotkey to speak to paste, before deciding whether cloud is worth a cent to you.
Two people will read this and want two different apps. One of them is about to caption a standup. The other is about to answer thirty emails before the school run, one hotkey press at a time. The only mistake is grabbing the wrong machine because both said "real time" on the box, and then opening a cloud dashboard in three months wondering where the bill came from. Pick by what you are doing. Watching speech, or writing it. Everything else follows from that.
Try the writing loop on your own laptop
Download Whisper, hold the key, say a sentence, watch it land where your cursor already is.
Free across the entire local pipeline. No payment method at signup.



