Speech-to-text with Whisper: How I Use It & Why

Blog by Sumana Harihareswara, Changeset founder

22 Dec 2022, 15:10 p.m.

Whisper, from OpenAI, is a new open source tool that "approaches human level robustness and accuracy on English speech recognition"; "Moreover, it enables transcription in multiple languages, as well as translation from those languages into English."

This is a really useful (and free!) tool. I have started using it regularly to make transcripts and captions (subtitles), and am writing to share how, and why, and my reflections on the ethics of using it. You can try Whisper using this website where you can upload audio files to transcribe; to run it on your own computer, skip down to "Logistics".

Ways I use it

I have used Whisper several times on English-language audio, and the results were very good (see "Accuracy" below). Whisper has successfully made it into my rotation of tools I reach for frequently without thinking "oh should I bother?"

Whisper does not yet differentiate between speakers ("diarization") in its text output, so it's a little more difficult to read and reuse its transcriptions for interviews and other multi-speaker recordings. But the accuracy and the privacy preservation make Whisper, for me, a game-changer for audio I spoke myself.

My use cases:

Making subtitles: I used ffmpeg to extract the audio from a video of my stand-up comedy and then ran Whisper on the resulting audio file. I was happily surprised to find that, by default, it also emitted a .srt subtitles file and a .vtt file. The .srt file is suitable for manual editing, local viewing alongside downloaded video, and uploading to video platforms to provide greater accessibility for future audiences. (I do edit those subtitles to improve the line lengths, and have asked whether Whisper could do that automatically.)

If I don't already have a copy of the video recording, but it's on one of the major streaming platforms like Vimeo or YouTube, I can use yt-dlp (the more-updated? alternative to youtube-dl) to grab a copy.

My workflow (leaving out the peculiarities of how I set up my virtual environment):

$ yt-dlp "https://www.youtube.com/watch?[VIDEO ID]" $ Download the YouTube, Vimeo, etc. video

$ ffmpeg -i recording.mp4 -c:a copy -vn audio-to-transcribe.m4a # Use ffmpeg to (nearly instantly) grab the audio track from a video recording

$ whisper audio-to-transcribe.m4a # Run the transcription with the default English-language model; this step emits the text transcript and the subtitle files into the current directory

And then I use GNOME Subtitles, a graphical UI tool, to polish those subtitles and fix the line lengths. Here's an example of the finished subtitles (to go with this 2021 ÖzgürKon recording). Ideally the subtitles then go on the video up on the streaming service, but at least I can also host them for others to download and use locally.

Mining old talks for transcripts and reusable text: I believe that, if you give a talk you care about, you should post a transcript, because that makes it more accessible, reusable, and influential. I'm probably going to use yt-dlp to download some conference talks that I delivered years ago and hadn't yet paid someone to transcribe. I can then publish the transcripts on my own site. Moreover, that'll make it easier for me to turn them into blog posts or reuse parts of them for my forthcoming book.

Re-enjoying family recordings: My household sometimes records audio conversations where we talk about movies and TV we've just watched. The archive is probably now in the scores of hours. We aren't going to publish them, so we were not about to pay someone to transcribe those many hours. But getting Whisper to chug on it in the background is totally doable, so now we'll have searchable transcripts to enjoy and revisit.

Note-taking during a rehearsal: I recently ran several videocalls to rehearse some standup comedy. For a few early ones, if I came up with a good riff spontaneously during the rehearsal, I paused to jot them down. But of course that broke the rhythm and the quality of the performance. Then, I started to record the audio of my own rehearsal performances. Sometimes I did this using the camera app on my phone, sometimes using Audacity or a sound recorder app on my laptop. I then ran the recording through Whisper afterwards to get a nearly accurate transcript. That let me skim through to find places I'd come up with a good joke on the fly, and then I could incorporate those improvements into my notes for the next runthrough.

Speaking a first draft aloud: Sometimes it's easier for me to start drafting a talk or a memo by speaking aloud. But I can't easily type while I talk at normal conversational speed. The workaround I previously used: ask a friend to listen and take notes as I talk, and email me the notes, which I turn into my first draft. Whisper's accuracy makes it possible for me to do this by myself, whether or not someone else is available. And the fact that I run it on my own computer, for free, makes me feel more okay about the privacy of my data -- in case I capture any other stray sound in the recording -- and about cost.

I have not myself tried Whisper to transcribe and read podcasts or videos I want to consume, make a transcript to share with a disabled friend, translate audio from other languages into English or vice versa, collect transcriptions of work meetings to help me create minutes and task lists, etc. But those all sound promising.

Accuracy

The first test I successfully ran on Whisper was a song by Josh Millard that is only one minute long. I was pleased to note that it rendered all the profanity accurately -- as opposed to the automatic captions in Google Meet, which censor swearing and, as I recall, the word "porn".

As I mentioned, I also used Whisper to transcribe a comedy routine of mine. Early in that set, I introduced myself. The Whisper transcript got my name nearly right, choosing an alternate transliteration ("Harihareshwara") that would be right in some contexts. This is, in my experience, unprecedented for automated transcription.

While Whisper is running on the command line, it emits the transcript with timecodes as a work in progress. Here's a sample:

[20:53.920 --> 21:04.320] a bug report about his behavior and goes away and fixes it. I'm actually pretty good at

[21:04.320 --> 21:10.000] reporting bugs. That might be my greatest strength as a project manager is actually that I often

[21:10.000 --> 21:15.200] will write a bug report that is so good and so clear and so compelling that you kind of

[21:15.200 --> 21:21.720] feel bad if you don't fix it. It's like you got to do it like a movie trailer, you know,

[21:21.720 --> 21:32.560] in a world where the username includes an emoji. One maintainer has the chance to make

[21:32.560 --> 21:46.680] a difference. I'm the Steven Spielberg of GitHub is what I'm saying. And of course,

As you can observe, the punctuation and capitalization are pretty good, but it doesn't quite put line breaks, paragraph breaks, and quotation marks where I would.

It does make mistakes. In transcribing my standup, Whisper did think "exhaustedly" was "exhaustively" which is, I admit, a much more common word.

And I haven't yet tried it on non-English languages.

Logistics

Whisper is an open source software tool written mostly in the Python programming language. Instructions on how to download, install, and run it are relatively straightforward, if you are comfortable running commands in a terminal. It depends on Python, a few Python libraries, and Rust. In case you want to try Whisper but you don't want to fiddle with installing it on your computer, the machine learning company Replicate is hosting a web-based version of Whisper so you can upload a sound file and get a transcription. But of course then you don't get the privacy benefits of running it entirely on your own machine.

Whisper comes with different sizes of model. Each model is a souped-up dataset with correlations that it uses to figure out what word a sound translates to, and the bigger models help it do the transcription more accurately, but they take more time to use. It's worth fiddling with using different sized models to see whether you can speed things up with acceptable accuracy tradeoffs. On my laptop it was pretty slow if I tried to use the large models (like, taking multiple hours to transcribe a 30-minute talk), and the default (smaller) models were less accurate but still fine for my purposes, and took more like an hour to transcribe a 30-minute talk.

Whisper does really well at figuring out what words someone said partly by analyzing what words they said, like, 30 seconds later, and then going back to extrapolate from that. For example, if you keep on saying feet-related words, then probably the thing you said 30 seconds ago was the noun "callus" and not the adjective "callous." So it does not work to run Whisper during recording of a talk or interview and give you live captions/transcripts; it runs after the fact, on already-recorded audio.

I heard from a friend at the Freedom of the Press Foundation about work on Stage Whisper, a web interface to Whisper specifically for use by journalists and newsrooms:

...not all journalists (or others who could benefit from this type of transcription tool) are comfortable with the command line and installing the dependencies required to run Whisper.

Our goal is to package Whisper in an easier to use way so that less technical users can take advantage of this neural net.....

The project is currently in the early stages of development.

Whisper's becoming a building block in applications for lots of uses, and those applications will often be easier to use than Whisper is; see the Whisper discussions on GitHub for some links. And Danny O'Brien predicts:

I expect it will be encoded into hardware at some point very soon, so we will have open hardware that can do the kind of voice to text that you otherwise have to hand over to Google, Amazon, and co.

Ethics

I have some qualms about using Whisper.

Whisper was "trained on 680,000 hours of multilingual and multitask supervised data collected from the web". Collected how? Did the speakers agree to this collection? Does Whisper claim that the legitimacy of its data collection stems from a clause buried in a clickthrough End User License Agreement that does not have any intelligible relationship to genuine human consent? Was copyright infringed? I don't know, and I did try to check.

I know that there are people who make their livings captioning or transcribing audio. I have friends and acquaintances (example) who make money doing so. I predict the easy availability of Whisper will shrink the market for their work. Now that I know how to use Whisper, I am using it to transcribe many pieces of audio that I would not have paid someone to transcribe, and assuming Whisper performs well on even bad-quality audio, I'm likely to use it instead of paying someone in the future, although that hasn't come up yet.

But:

Whisper is free of cost and the code is open source (the authors do actually accept patches; I landed one). When I use it, it does not send data back home; me using it does not directly help train the next iteration or tell a corporate parent about things people said in private.

Transcripts and captions make audio more accessible to deaf and hard-of-hearing people, people with ADHD, people with audio processing issues, and people who are learning English. And publishing polished transcripts makes search, linking, and citation far easier.

And I'm influenced here by what Danny O'Brien says about Stable Diffusion, a somewhat similar tool. It's worth reading the whole (short) post about relative power, copyright, capabilities, and empowerment, which ends:

Artists should get paid; and they shouldn’t have to pay for the privilege of building on our common heritage. They should be empowered to create amazing works from new tools, just as they did with the camera, the television, the sampler and the VHS recorder, the printer, the photocopier, Photoshop, and the Internet. A 4.2GiB file isn’t a heist of every single artwork on the Internet, and those who think it is are the ones undervaluing their own contributions and creativity. It’s an amazing summary of what we know about art, and everyone should be able to use it to learn, grow, and create.

How should I decide, not just about this, but about other tools of this type?

Several people I know (such as Luis Villa) are working on how to think about these kinds of ethical choices in the absence of clear, canonical, and thoughtful guidance from religious and spiritual leaders, trusted charities, governments, et alia. We have some guidance from ethicists and philosophers on whether to make particular kinds of tools (along the lines of the Declaration of Digital Autonomy); I am still looking for frameworks for deciding which tools to use (along the lines of the Franklin Street Statement on Freedom and Network Services).

Software Freedom Conservancy has shared its thinking regarding GitHub Copilot, which is an aid to writing software; their focus is particularly on "AI-assisted authorship of software using copylefted training sets"; they advise against using Copilot, and I haven't used it.

More generally, Simon Willison has popularized a "vegan" analogy for people who abstain from using these kinds of machine learning-trained tools. I want to dig into that a little more. Some people are vegans for personal health reasons, some as part of an organized political movement trying to improve animal welfare or working to reduce environmental damage, some as part of an organized movement aiming to decrease harm to humans (from factory farming, etc.), some because reverence/purity/disgust/sacredness make them viscerally averse to eating animal-based food, and some people want to fit in with vegan peers -- and I've probably missed other reasons, and some folks have multiple reasons, and some people are just trying it out of curiosity.

So I'm reviewing these rationales to check whether they apply here.

Personal health: Will it harm me to use Whisper? Possibly. For instance, if I start to trust the output and let my guard down -- as I likely will over time, when I'm in a hurry -- then I could read a biased inaccurate transcription and assume it's correct, and that could lead me to believe something false. There's also a tiny chance that courts could decide that, by publishing a Whisper-generated transcript, I am infringing the copyright of those who wrote the transcripts that researchers trained Whisper on, and then the courts could penalize me for that. I consider that risk negligible.

Environmental advocacy: Is there an organized political movement asking people to boycott Whisper as part of a push for carbon reduction or a similar environmental effort? If so, I'm unaware of it, but please let me know if there is one. Whisper exists in the world whether I use it or not, and -- unlike with a paid or ad-supported commercial product -- my usage does not increase the likelihood it will continue. Me as a single person choosing not to use it would not be like a boycott and would not affect OpenAI's choices.

Reducing harm to/exploitation of other humans: As researchers collected data to use to train Whisper, they probably swept in data that the authors didn't mean to offer for that usage. As with the previous point, I am unaware of an advocacy group that asks me to boycott Whisper as part of a push for better consent standards for data collection that will be used for machine learning training. In the world of visual art, artists are banding together to campaign for limitations on scrapers using their art as fodder to train AI on (in particular, here's a thread about protest against a planned change by DeviantArt). Please let me know if there is something like this regarding captions and transcripts.

What about the decrease in work available for transcribers and captioners, or that work getting worse? Conversations on MetaFilter and elsewhere have helped me understand that there's a lot of nuance here, because, for instance, medical transcribers already do a lot of work with editing and fixing computer-generated transcripts rather than doing the whole transcription from scratch. There's a related ethical question here which is: under what circumstances should we take extraordinary steps to retain and sustain a particular occupation beyond what the market would do on solely economic grounds? Perhaps there's an argument that this is a skill we need to preserve in case we have need of it in the future, or there's a general social welfare argument about keeping people who cannot easily find another career from falling into poverty. Again, if there is a group (especially a workers' guild) asking me to boycott Whisper for the sake of transcribers, I'd like to know about that, but I don't yet.

Reverence: I have tried to be assiduous in responsibly reusing clip art, music, etc. in work that I publish, partly for cerebral reasons but also partly because of a deep-seated, beyond-rational respect for creative work and a visceral distaste for broken attribution chains. The Creative Commons framework makes this easier, of course. When I reflect on the people who made transcripts that researchers trained Whisper on, and on how I reuse a derivative of their work using Whisper, I have a mix of emotions, including gratitude, worry, and a kind of eerieness, but not the kind of NO NO NO gut-level reaction that stops me from using some other products. I'm guessing people often have more gut reactions to AI-prompted art -- artifacts laden with emotional meaning -- than to functional tools like transcripts. And, as my writing partner Jacob mentioned as I was discussing this with him, programmers who identify with their work a lot will have pretty visceral feelings about tools like Copilot.

Fitting in: Within the subset of my social circles where people have opinions on machine learning tools, people are split regarding really enjoying or really despising AI-prompted/generated visual art, and a GPT-enabled chatbot. I haven't read anyone making arguments against using Whisper yet, and some people I know and like use it, but I also feel bad that my friends who make money from transcribing audio will know I am sometimes using this alternative.

Given this assessment I am pretty sure I'm going to continue to use Whisper. How ought I use it and similar classification tools responsibly? I figure I should try to be on my guard about inaccuracies, especially biased ones. If I publish output, I should be transparent and acknowledge the provenance of the transcription, sharing authorship in the metadata. Maybe I should make sure to tell people I know about it, especially those who currently make captions and transcripts manually, especially those who do so for money, so they have a shot at automating the tedious work and spending their time on more valuable things. And I should try to be aware of social movements that articulate ethical frameworks for using or abjuring these tools, in case I ought to change my mind.

An Intro-To-Rust Course

A Silliness

Comments

David Weinberger
https://www.weinberger.org/writings/index.html
23 Dec 2022, 11:51 a.m.

Really helpful and thorough! I had been using happyscribe.com which is quite accurate and relatively inexpensive, but you make a great case for Whisper. Thanks, Sumana!

Eric Johansson
26 Dec 2022, 13:22 p.m.

I think Stalin addressed the issue of large corpus versus individual ownership when he said. "one death is a tragedy, one million is a statistic". If a person's artwork, speech sample, whatever is used in training a machine learning system, that individual contribution has infinitesimal value. It's only when millions of samples are used does the aggregate have any value.

A second perspective is whether the natural analogs to machine learning? Many artists, on up achieving commercial success had studios with dozens or even hundreds of wet neural networks that trained using training sets and corrections to replicate the artist's work. Today, we use dry neural nets to do the same thing. It's just faster and more accurate.

When the wet neural net tried to create their own work, they are influenced by what they have seen and done in the past hence the phrase "in the style of". Dry neural nets again do the same thing, just faster and with greater flexibility.

Wet neural nets have been and will be accused of plagiarism and forgery when they create something too close "to the style of" .... Dry neural nets can do the same thing if told to do so but in that case, if the person telling them what to do that should be held responsible for plagiarism and forgery.

The natural conclusion of the anti-machine learning is that at some point, one will not be of the create without paying the tax to the work that is similar to yours. Creation will slow down and stop because all current work is derived from what a person learned in the past.

At the end of the day, human learning and machine learning is fundamentally the same process. There are different mechanisms internally, different kinds of errors that may lead to new creations, but at the end, stimulus plus training generates output. Similar stimulus and training generates similar output.