Getting Started with OpenAI Whisper
By Justin

OpenAI Whisper is an incredible, now open source, tool to transcibe audio with near perfect quality.
This post will show you the basics of how to get started all the way to outputting your automated transcription in the popular format WebVTT (.vtt) for video captions.
Fun fact. The thumbnail image on this post was generated by OpenAI Dalle2.
python
# install ffmpeg if you don't already have it on your system
# !apt update && apt install ffmpeg
Install OpenAI Whisper
python
python -m pip install git+https://github.com/openai/whisper.git
Install PyTube for our Video Download
python
python -m pip install pytube
Download a video to transcribe. In this case, I'm using one of my videos from https://cfe.sh/youtube
PyTube is awesome but please let's not abuse downloading videos this way.
python
from pytube import YouTube
YouTube('https://www.youtube.com/watch?v=VtHrTX_nTto').streams\
.filter(
progressive=True, file_extension='mp4'
)\
.order_by('resolution')\
.first()\
.download()
Automatically rename your download to video.mp4
python
import pathlib
videos = list(pathlib.Path().glob("*.mp4"))
video = videos[0].rename("video.mp4")
Convert video.mp4 into audio.mp3 with ffmpeg
python
import subprocess
command = "ffmpeg -i video.mp4 -ab 160k -ac 2 -ar 44100 -vn audio.mp3"
subprocess.call(command, shell=True)
Use OpenAI Whisper to transcribe our video:
python
import whisper
# We're using the `base` size model. Check out
# https://github.com/openai/whisper#available-models-and-languages
# for more robust models.
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"][:240])
Results are:
Software and automation are things of the present as well as the future. And me personally, I've wanted to be a part of that for a very long time. See, I started this channel to show you how you can do the same through code, right? How you
python
import datetime
def timedelta_to_videotime(delta):
"""
Here's a janky way to format a
datetime.timedelta to match
the format of vtt timecodes.
"""
parts = delta.split(":")
if len(parts[0]) == 1:
parts[0] = f"0{parts[0]}"
new_data = ":".join(parts)
parts2 = new_data.split(".")
if len(parts2) == 1:
parts2.append("000")
elif len(parts2) == 2:
parts2[1] = parts2[1][:2]
final_data = ".".join(parts2)
return final_data
python
def whisper_segments_to_vtt_data(result_segments):
"""
This function iterates through all whisper
segements to format them into WebVTT.
"""
data = "WEBVTT\n\n"
for idx, segment in enumerate(result_segments):
num = idx + 1
data+= f"{num}\n"
start_ = datetime.timedelta(seconds=segment.get('start'))
start_ = timedelta_to_videotime(str(start_))
end_ = datetime.timedelta(seconds=segment.get('end'))
end_ = timedelta_to_videotime(str(end_))
data += f"{start_} --> {end_}\n"
text = segment.get('text').strip()
data += f"{text}\n\n"
return data
python
caption_data = whisper_segments_to_vtt_data(result['segments'])
python
print(caption_data)
WEBVTT
1
00:00:00.000 --> 00:00:06.12
Software and automation are things of the present as well as the future.
2
00:00:06.12 --> 00:00:10.92
And me personally, I've wanted to be a part of that for a very long time.
3
00:00:10.92 --> 00:00:17.20
See, I started this channel to show you how you can do the same through code, right?
4
00:00:17.20 --> 00:00:22.68
How you can build something real step by step and start thinking through how you can
5
00:00:22.68 --> 00:00:25.000
automate things with software.
6
00:00:25.000 --> 00:00:30.76
And I do it with more of a business flair. That's why the entrepreneurial side of things.
7
00:00:30.76 --> 00:00:35.36
So what I hope that you do is subscribe and watch these tutorials.
8
00:00:35.36 --> 00:00:39.84
But more importantly, I hope that you do them and do something with them.
9
00:00:39.84 --> 00:00:43.76
Please let me know who you are. Tell me in the comments. I would love to check out the
10
00:00:43.76 --> 00:00:49.04
things that you're working on and what you're doing as well so we can all do this together.
11
00:00:49.04 --> 00:00:53.40
Because that's what it's all about. It's what you do. It's what we do together as a community
12
00:00:53.40 --> 00:00:57.40
to build real things and solve real problems.
13
00:00:57.40 --> 00:01:26.40
Thanks so much. Hope to see you in the series.
Save transcription to local file.
python
export_file = pathlib.Path('captions.vtt')
export_file.write_text(caption_data)