Can I generate subtitles from audio only, with no video?

Yes. VEED, Kapwing, Descript, and the open-source Whisper CLI all transcribe directly from an audio file — MP3, WAV, M4A, FLAC. No video track is required; the tool times the speech on its own.

How do I get an SRT file from an MP3 or WAV?

Upload the file to a web tool and choose SRT at export, or run the OpenAI Whisper CLI locally with `--output_format srt`. You can also just leave the default, which produces all formats including .srt.

What's the difference between subtitles, captions, and closed captions?

Subtitles assume you can hear the audio and mostly carry dialogue. Captions and closed captions also describe non-speech sound, like music or a door slamming, for viewers who can't hear it. SRT and VTT files can serve either purpose depending on what you write in them.

How accurate are AI-generated subtitles?

Good, not flawless. Kapwing advertises 99% accuracy as its own marketing figure. In practice, clean single-speaker audio gets close while names, jargon, and crosstalk still need a human pass. Always read the output before you ship it.

Can I generate subtitles offline for free?

Yes. The OpenAI Whisper command-line tool runs entirely on your machine and writes .srt and .vtt files at no cost. The multilingual models cover 99 languages; the English-only .en builds cover one.

Can the tool handle multiple speakers?

Most subtitle generators transcribe every voice but don't label who said what unless they offer speaker diarization as a separate feature. If you need "Speaker 1 / Speaker 2" tags, check that the specific tool lists diarization before you commit.

Can I translate the subtitles into another language?

Some tools can. Kapwing translates subtitles into 100+ languages; Descript covers 22+. The open-source Whisper CLI can translate non-English speech into English subtitles but doesn't translate between two non-English languages.

By Denys MedvedievApril 23, 2026

Tutorial

Audio to subtitles: what works

A subtitle generator from audio turns a sound file into a timed SRT or VTT. Here's the real workflow, the tools that actually export one, and the free offline option that never uploads your audio.

Last updated: June 2026

Audio waveforms displayed on a screen, illustrating turning a sound file into a subtitle track

A subtitle generator from audio takes a sound file, an MP3 or a WAV or a podcast export, and writes a timed subtitle file. Each line of text carries a start and end timestamp. Web tools like VEED, Kapwing, and Descript do this in a browser. The free OpenAI Whisper command-line tool does it offline on your own machine.

I once spent forty minutes captioning a ten-minute podcast clip by hand, pausing every three seconds to type and guess at timestamps. I have a master's degree in software engineering. The math is brutal either way. Captioning by hand runs several times the length of the audio. A modern subtitle generator does the same job in about the length of the file plus a coffee. The catch nobody tells you up front is that the right tool depends on one question. Do you need a timed file you can download, or just the words?

"Subtitle generator" gets used for two different jobs, and the wrong tool costs you an afternoon. The space splits into browser tools that export timed files and offline tools that do the same for free if you'll touch a terminal. This guide covers how the workflow runs, which tools output a real .srt file from audio alone, what SRT and VTT and TXT each mean, and where a dictation app like ours is the wrong pick. By the end you'll know which tool to open for your deliverable. Most of the confusion I read in our support inbox comes from people who picked a typing tool when they needed a subtitle file. A year of those messages is most of why this article exists.

You need timestamps, not just text

A subtitle file is not a transcript. A transcript is words. A subtitle file is words plus timing. Every caption block says "show this line from 00:01:04 to 00:01:07." That timing is the whole job. It lets a video player put the right words on screen at the right second.

Most "voice to text" tools, ours included, hand you words and nothing else. They paste a clean paragraph at your cursor and stop there. A subtitle generator from audio has to do more. It splits the speech into short caption-sized chunks, aligns each chunk to the audio clock, and writes it all out in a strict file format a player can read. If your deliverable is a file you upload to YouTube, a video editor, or a course platform, you need the timestamps. If your deliverable is text in a document, you don't, and you should not pay for a subtitle tool to get it.

How to generate subtitles from an audio file in three steps

Laptop running audio editing software with headphones on a desk in a home workspace

The workflow is the same across almost every tool, web or offline.

Upload or point at the audio file. Most tools take MP3, WAV, M4A, and FLAC, no video required. VEED accepts MP3, WAV, podcast recordings, interview audio, and voice memos. If your only source is a video, the tool strips the audio for you.

Let it transcribe and time the speech. The tool runs the audio through a speech model, chops the result into caption-length lines, and stamps each one with a start and end time. The hand version eats several times the audio length. The machine version takes about the length of the file.

Review and export the file. Read the transcript once (model output is good, not perfect), fix any names it mangled, then export. You pick the format here: SRT, VTT, or plain TXT.

That's the entire loop. The differences between tools come down to price, language coverage, where your audio goes, and whether step three is free.

SRT vs VTT vs TXT: which file do you need

Three formats show up in every export menu, and people grab the wrong one constantly.

SRT (SubRip) is the default subtitle file. It's a plain text file of numbered blocks, each with a timecode range and a line or two of text. YouTube, most video editors, and nearly every player read it. If you don't know which to pick, pick SRT.
VTT (WebVTT) is SRT's web cousin. Same idea, slightly different syntax, plus support for styling and positioning. Use VTT when a website or HTML5 video player asks for it by name.
TXT is the words, no timestamps. This is the format you want when you're writing an article, feeding a summary, or quoting an interview. It is also the only one of the three a plain dictation tool can give you.

My rule of thumb: SRT for video, TXT for documents, VTT when a web platform names it. Most tools export all three: VEED, Kapwing, and Descript.

The tools that turn audio into subtitle files

Here's where each browser tool lands, with the capability claims taken straight from each tool's own page.

VEED is a web and mobile auto-subtitle generator that transcribes from an audio-only file and lets you download the result as SRT, VTT, or TXT. It's free to start. Downloading the subtitle file and captioning longer videos move you onto a paid tier.
Kapwing advertises "99% accurate subtitles, generated in seconds." That's Kapwing's own marketing figure, not an independent benchmark. It takes any video or audio file, including MP3, can translate subtitles into 100+ languages, and exports SRT, VTT, and TXT. Free accounts get up to 10 minutes of subtitles and a watermark; Pro removes the watermark.
Descript generates subtitles in 22+ languages, accepts audio-only files, and exports soft subtitles as SRT or VTT through Publish, then Export, then Subtitles. It runs on a freemium model with a free tier of one media hour a month.

Here is how those four stack up on the parts you can verify before you commit. No accuracy or speed numbers, because nobody has run them head to head on the same audio:

Tool	Platform	Local or cloud	Works offline	Pricing model	Languages	Best for
VEED	Web, mobile	Cloud	No	Free start, paid to export	Lists 40+ options, no stated total	A fast browser pass with a download
Kapwing	Web	Cloud	No	Free tier (watermark), Pro	Translates into 100+	Quick captions plus translation
Descript	Web	Cloud	No	Freemium, one media hour free	22+	Editing audio and captions together
OpenAI Whisper CLI	Windows, macOS, Linux	Local	Yes	Free, open source	99 multilingual, 1 for .en builds	Free, private, no upload

All three browser tools put your audio on someone else's server. For a marketing clip that's fine. For a recorded client call or anything with a salary figure in it, keep reading.

Those tools share a UI shape that looks roughly like this:

interview-audio.mp3Auto subtitle

SRTVTTTXTDownload

Upload, click generate, pick a format, download. That bar, not ours, is what a subtitle generator from audio looks like.

Free and offline: generating SRT with open-source Whisper

Code on a computer screen in dark mode, evoking a command-line subtitle workflow

If you'd rather not upload anything, OpenAI's open-source Whisper command-line tool writes subtitle files on your own machine for free. Its --output_format flag accepts txt, vtt, srt, tsv, json, or all, and defaults to all. So one command, whisper interview.mp3 --model turbo, produces an .srt file offline with no account and no upload.

Open-source Whisper is a different project from Whisper by Remskill, and worth being clear about. It is OpenAI's command-line model that runs on your computer and emits timed subtitle files. It ships six model sizes (tiny, base, small, medium, large, and turbo) with English-only variants for the four smaller ones. The multilingual models cover 99 languages; the .en variants are English only.

Here's the opinion I'll stand behind: for anything sensitive, the audio should never leave your laptop. A recorded performance review, a doctor's dictated notes, a legal deposition, none of that belongs in a vendor's processing logs just because you needed timestamps.

I once watched a team rack up a five-figure cloud-AI bill in one quarter transcribing standup recordings. The CFO's reaction in the next review wasn't "let's optimize the prompt." It was "why are we sending meeting audio to a server at all." Your laptop already has a CPU and a microphone. For private material, the offline Whisper CLI is the answer, and it costs nothing.

There's a faster local port called whisper.cpp, a plain C/C++ build of Whisper with no dependencies that runs CPU-only under an open license. People report it can write subtitle files too, though I'd point you at the official OpenAI Whisper CLI for the verified .srt path and treat whisper.cpp as the speed upgrade once you're comfortable.

When Whisper by Remskill is the wrong tool for this

Pasted

Whisper's overlay in its complete state — it pastes a clean paragraph at your cursor, not a timed subtitle file. The blue widget sits on top of any app.

Here's the part most product blogs skip. If your job is a downloadable .srt or .vtt file, our app is the wrong tool, and I'd rather tell you now than waste your download.

Whisper by Remskill is dictation-first. You hold a hotkey (Ctrl+Space on Windows, Command+Option on macOS), speak, release, and the transcription pastes at your cursor in whatever app is open. It does not chop speech into caption blocks, it does not align text to an audio clock, and it does not write a timed subtitle file. Feed it an interview and you'll get a clean paragraph, not an SRT. I built the export menu in my head a dozen times and then didn't ship it, because timed captions are their own product and doing them badly helps nobody.

Use the tools above for subtitle files. Reach for our app for the adjacent job: turning your own speech into text the moment you need it. An email, a draft, a caption you'll type into a social post by hand. It runs on two pure-Rust engines, OpenAI Whisper and NVIDIA Parakeet, with no Python and no upload. Different job, different tool. Picking the right one is the whole point of this article.

Before you open anything, answer the question that decides everything: are you shipping a file or shipping words? A file means timestamps, which means a real subtitle generator. VEED or Kapwing for a quick browser pass, the Whisper CLI for free and private. Words mean a transcript, and that's a different tool. I built a dictation app and I'll still send you somewhere else when somewhere else is right. My seven-year-old asked me last week what I make at work, and the honest answer is that I help people stop typing, which she found deeply underwhelming. The afternoon you save is the one I spent captioning that podcast clip by hand, three seconds at a time.

Want the dictation half instead?

If your job is words at the cursor, not a subtitle file, Whisper turns your own speech into text the moment you need it, fully offline.

See how Whisper works View pricing

Free local dictation for every signed-in user. For subtitle files, use the tools above.

Denys Medvediev

I'm the one who reads our support email, most probably by dictating the replies.

Audio to subtitles: what works

A subtitle generator from audio turns a sound file into a timed SRT or VTT. Here's the real workflow, the tools that actually export one, and the free offline option that never uploads your audio.

Last updated: June 2026

You need timestamps, not just text

How to generate subtitles from an audio file in three steps

The workflow is the same across almost every tool, web or offline.

Review and export the file. Read the transcript once (model output is good, not perfect), fix any names it mangled, then export. You pick the format here: SRT, VTT, or plain TXT.

That's the entire loop. The differences between tools come down to price, language coverage, where your audio goes, and whether step three is free.

SRT vs VTT vs TXT: which file do you need

Three formats show up in every export menu, and people grab the wrong one constantly.

SRT (SubRip) is the default subtitle file. It's a plain text file of numbered blocks, each with a timecode range and a line or two of text. YouTube, most video editors, and nearly every player read it. If you don't know which to pick, pick SRT.
VTT (WebVTT) is SRT's web cousin. Same idea, slightly different syntax, plus support for styling and positioning. Use VTT when a website or HTML5 video player asks for it by name.
TXT is the words, no timestamps. This is the format you want when you're writing an article, feeding a summary, or quoting an interview. It is also the only one of the three a plain dictation tool can give you.

My rule of thumb: SRT for video, TXT for documents, VTT when a web platform names it. Most tools export all three: VEED, Kapwing, and Descript.

The tools that turn audio into subtitle files

Here's where each browser tool lands, with the capability claims taken straight from each tool's own page.

VEED is a web and mobile auto-subtitle generator that transcribes from an audio-only file and lets you download the result as SRT, VTT, or TXT. It's free to start. Downloading the subtitle file and captioning longer videos move you onto a paid tier.
Kapwing advertises "99% accurate subtitles, generated in seconds." That's Kapwing's own marketing figure, not an independent benchmark. It takes any video or audio file, including MP3, can translate subtitles into 100+ languages, and exports SRT, VTT, and TXT. Free accounts get up to 10 minutes of subtitles and a watermark; Pro removes the watermark.
Descript generates subtitles in 22+ languages, accepts audio-only files, and exports soft subtitles as SRT or VTT through Publish, then Export, then Subtitles. It runs on a freemium model with a free tier of one media hour a month.

Here is how those four stack up on the parts you can verify before you commit. No accuracy or speed numbers, because nobody has run them head to head on the same audio:

Tool	Platform	Local or cloud	Works offline	Pricing model	Languages	Best for
VEED	Web, mobile	Cloud	No	Free start, paid to export	Lists 40+ options, no stated total	A fast browser pass with a download
Kapwing	Web	Cloud	No	Free tier (watermark), Pro	Translates into 100+	Quick captions plus translation
Descript	Web	Cloud	No	Freemium, one media hour free	22+	Editing audio and captions together
OpenAI Whisper CLI	Windows, macOS, Linux	Local	Yes	Free, open source	99 multilingual, 1 for .en builds	Free, private, no upload

All three browser tools put your audio on someone else's server. For a marketing clip that's fine. For a recorded client call or anything with a salary figure in it, keep reading.

Those tools share a UI shape that looks roughly like this:

interview-audio.mp3Auto subtitle

SRTVTTTXTDownload

Upload, click generate, pick a format, download. That bar, not ours, is what a subtitle generator from audio looks like.

Free and offline: generating SRT with open-source Whisper

When Whisper by Remskill is the wrong tool for this

Pasted

Whisper's overlay in its complete state — it pastes a clean paragraph at your cursor, not a timed subtitle file. The blue widget sits on top of any app.

Here's the part most product blogs skip. If your job is a downloadable .srt or .vtt file, our app is the wrong tool, and I'd rather tell you now than waste your download.

Want the dictation half instead?

If your job is words at the cursor, not a subtitle file, Whisper turns your own speech into text the moment you need it, fully offline.

See how Whisper works View pricing

Free local dictation for every signed-in user. For subtitle files, use the tools above.

Denys Medvediev

I'm the one who reads our support email, most probably by dictating the replies.

Audio to subtitles: what works

You need timestamps, not just text

How to generate subtitles from an audio file in three steps

SRT vs VTT vs TXT: which file do you need

The tools that turn audio into subtitle files

Free and offline: generating SRT with open-source Whisper

When Whisper by Remskill is the wrong tool for this

Want the dictation half instead?

Further reading

Frequently asked questions

Voice typing in Word

The voice typing shortcut on every OS

Google voice typing alternative: dictate anywhere

Audio to subtitles: what works

You need timestamps, not just text

How to generate subtitles from an audio file in three steps

SRT vs VTT vs TXT: which file do you need

The tools that turn audio into subtitle files

Free and offline: generating SRT with open-source Whisper

When Whisper by Remskill is the wrong tool for this

Want the dictation half instead?

Further reading

Frequently asked questions

Voice typing in Word

The voice typing shortcut on every OS

Google voice typing alternative: dictate anywhere

Audio to subtitles: what works

You need timestamps, not just text

How to generate subtitles from an audio file in three steps

SRT vs VTT vs TXT: which file do you need

The tools that turn audio into subtitle files

Free and offline: generating SRT with open-source Whisper

When Whisper by Remskill is the wrong tool for this

Want the dictation half instead?

Further reading

Frequently asked questions

Keep reading

Voice typing in Word

The voice typing shortcut on every OS

Google voice typing alternative: dictate anywhere

Audio to subtitles: what works

You need timestamps, not just text

How to generate subtitles from an audio file in three steps

SRT vs VTT vs TXT: which file do you need

The tools that turn audio into subtitle files

Free and offline: generating SRT with open-source Whisper

When Whisper by Remskill is the wrong tool for this

Want the dictation half instead?

Further reading

Frequently asked questions

Keep reading

Voice typing in Word

The voice typing shortcut on every OS

Google voice typing alternative: dictate anywhere