Tutorial
Audio to subtitles: what works
A subtitle generator from audio turns a sound file into a timed SRT or VTT. Here's the real workflow, the tools that actually export one, and the free offline option that never uploads your audio.
Last updated: June 2026

A subtitle generator from audio takes a sound file, an MP3 or a WAV or a podcast export, and writes a timed subtitle file. Each line of text carries a start and end timestamp. Web tools like VEED, Kapwing, and Descript do this in a browser. The free OpenAI Whisper command-line tool does it offline on your own machine.
I once spent forty minutes captioning a ten-minute podcast clip by hand, pausing every three seconds to type and guess at timestamps. I have a master's degree in software engineering. The math is brutal either way. Captioning by hand runs several times the length of the audio. A modern subtitle generator does the same job in about the length of the file plus a coffee. The catch nobody tells you up front is that the right tool depends on one question. Do you need a timed file you can download, or just the words?
"Subtitle generator" gets used for two different jobs, and the wrong tool costs you an afternoon. The space splits into browser tools that export timed files and offline tools that do the same for free if you'll touch a terminal. This guide covers how the workflow runs, which tools output a real .srt file from audio alone, what SRT and VTT and TXT each mean, and where a dictation app like ours is the wrong pick. By the end you'll know which tool to open for your deliverable. Most of the confusion I read in our support inbox comes from people who picked a typing tool when they needed a subtitle file. A year of those messages is most of why this article exists.
You need timestamps, not just text
A subtitle file is not a transcript. A transcript is words. A subtitle file is words plus timing. Every caption block says "show this line from 00:01:04 to 00:01:07." That timing is the whole job. It lets a video player put the right words on screen at the right second.
Most "voice to text" tools, ours included, hand you words and nothing else. They paste a clean paragraph at your cursor and stop there. A subtitle generator from audio has to do more. It splits the speech into short caption-sized chunks, aligns each chunk to the audio clock, and writes it all out in a strict file format a player can read. If your deliverable is a file you upload to YouTube, a video editor, or a course platform, you need the timestamps. If your deliverable is text in a document, you don't, and you should not pay for a subtitle tool to get it.
How to generate subtitles from an audio file in three steps

The workflow is the same across almost every tool, web or offline.
Upload or point at the audio file. Most tools take MP3, WAV, M4A, and FLAC, no video required. VEED accepts MP3, WAV, podcast recordings, interview audio, and voice memos. If your only source is a video, the tool strips the audio for you.
Let it transcribe and time the speech. The tool runs the audio through a speech model, chops the result into caption-length lines, and stamps each one with a start and end time. The hand version eats several times the audio length. The machine version takes about the length of the file.
Review and export the file. Read the transcript once (model output is good, not perfect), fix any names it mangled, then export. You pick the format here: SRT, VTT, or plain TXT.
That's the entire loop. The differences between tools come down to price, language coverage, where your audio goes, and whether step three is free.
SRT vs VTT vs TXT: which file do you need
Three formats show up in every export menu, and people grab the wrong one constantly.
- SRT (SubRip) is the default subtitle file. It's a plain text file of numbered blocks, each with a timecode range and a line or two of text. YouTube, most video editors, and nearly every player read it. If you don't know which to pick, pick SRT.
- VTT (WebVTT) is SRT's web cousin. Same idea, slightly different syntax, plus support for styling and positioning. Use VTT when a website or HTML5 video player asks for it by name.
- TXT is the words, no timestamps. This is the format you want when you're writing an article, feeding a summary, or quoting an interview. It is also the only one of the three a plain dictation tool can give you.
My rule of thumb: SRT for video, TXT for documents, VTT when a web platform names it. Most tools export all three: VEED, Kapwing, and Descript.
The tools that turn audio into subtitle files
Here's where each browser tool lands, with the capability claims taken straight from each tool's own page.
- VEED is a web and mobile auto-subtitle generator that transcribes from an audio-only file and lets you download the result as SRT, VTT, or TXT. It's free to start. Downloading the subtitle file and captioning longer videos move you onto a paid tier.
- Kapwing advertises "99% accurate subtitles, generated in seconds." That's Kapwing's own marketing figure, not an independent benchmark. It takes any video or audio file, including MP3, can translate subtitles into 100+ languages, and exports SRT, VTT, and TXT. Free accounts get up to 10 minutes of subtitles and a watermark; Pro removes the watermark.
- Descript generates subtitles in 22+ languages, accepts audio-only files, and exports soft subtitles as SRT or VTT through Publish, then Export, then Subtitles. It runs on a freemium model with a free tier of one media hour a month.
Here is how those four stack up on the parts you can verify before you commit. No accuracy or speed numbers, because nobody has run them head to head on the same audio:
| Tool | Platform | Local or cloud | Works offline | Pricing model | Languages | Best for |
|---|---|---|---|---|---|---|
| VEED | Web, mobile | Cloud | No | Free start, paid to export | Lists 40+ options, no stated total | A fast browser pass with a download |
| Kapwing | Web | Cloud | No | Free tier (watermark), Pro | Translates into 100+ | Quick captions plus translation |
| Descript | Web | Cloud | No | Freemium, one media hour free | 22+ | Editing audio and captions together |
| OpenAI Whisper CLI | Windows, macOS, Linux | Local | Yes | Free, open source | 99 multilingual, 1 for .en builds | Free, private, no upload |
All three browser tools put your audio on someone else's server. For a marketing clip that's fine. For a recorded client call or anything with a salary figure in it, keep reading.
Those tools share a UI shape that looks roughly like this:
Upload, click generate, pick a format, download. That bar, not ours, is what a subtitle generator from audio looks like.
Free and offline: generating SRT with open-source Whisper

If you'd rather not upload anything, OpenAI's open-source Whisper command-line tool writes subtitle files on your own machine for free. Its --output_format flag accepts txt, vtt, srt, tsv, json, or all, and defaults to all. So one command, whisper interview.mp3 --model turbo, produces an .srt file offline with no account and no upload.
Open-source Whisper is a different project from Whisper by Remskill, and worth being clear about. It is OpenAI's command-line model that runs on your computer and emits timed subtitle files. It ships six model sizes (tiny, base, small, medium, large, and turbo) with English-only variants for the four smaller ones. The multilingual models cover 99 languages; the .en variants are English only.
Here's the opinion I'll stand behind: for anything sensitive, the audio should never leave your laptop. A recorded performance review, a doctor's dictated notes, a legal deposition, none of that belongs in a vendor's processing logs just because you needed timestamps.
I once watched a team rack up a five-figure cloud-AI bill in one quarter transcribing standup recordings. The CFO's reaction in the next review wasn't "let's optimize the prompt." It was "why are we sending meeting audio to a server at all." Your laptop already has a CPU and a microphone. For private material, the offline Whisper CLI is the answer, and it costs nothing.
There's a faster local port called whisper.cpp, a plain C/C++ build of Whisper with no dependencies that runs CPU-only under an open license. People report it can write subtitle files too, though I'd point you at the official OpenAI Whisper CLI for the verified .srt path and treat whisper.cpp as the speed upgrade once you're comfortable.
When Whisper by Remskill is the wrong tool for this
Here's the part most product blogs skip. If your job is a downloadable .srt or .vtt file, our app is the wrong tool, and I'd rather tell you now than waste your download.
Whisper by Remskill is dictation-first. You hold a hotkey (Ctrl+Space on Windows, Command+Option on macOS), speak, release, and the transcription pastes at your cursor in whatever app is open. It does not chop speech into caption blocks, it does not align text to an audio clock, and it does not write a timed subtitle file. Feed it an interview and you'll get a clean paragraph, not an SRT. I built the export menu in my head a dozen times and then didn't ship it, because timed captions are their own product and doing them badly helps nobody.
Use the tools above for subtitle files. Reach for our app for the adjacent job: turning your own speech into text the moment you need it. An email, a draft, a caption you'll type into a social post by hand. It runs on two pure-Rust engines, OpenAI Whisper and NVIDIA Parakeet, with no Python and no upload. Different job, different tool. Picking the right one is the whole point of this article.
Before you open anything, answer the question that decides everything: are you shipping a file or shipping words? A file means timestamps, which means a real subtitle generator. VEED or Kapwing for a quick browser pass, the Whisper CLI for free and private. Words mean a transcript, and that's a different tool. I built a dictation app and I'll still send you somewhere else when somewhere else is right. My seven-year-old asked me last week what I make at work, and the honest answer is that I help people stop typing, which she found deeply underwhelming. The afternoon you save is the one I spent captioning that podcast clip by hand, three seconds at a time.
Want the dictation half instead?
If your job is words at the cursor, not a subtitle file, Whisper turns your own speech into text the moment you need it, fully offline.
Free local dictation for every signed-in user. For subtitle files, use the tools above.



