Getting Started with OpenAI Whisper

OpenAI Whisper is an incredible, now open source, tool to transcibe audio with near perfect quality.

This post will show you the basics of how to get started all the way to outputting your automated transcription in the popular format WebVTT (.vtt) for video captions.

Fun fact. The thumbnail image on this post was generated by OpenAI Dalle2.

python

# install ffmpeg if you don't already have it on your system
# !apt update && apt install ffmpeg

Install OpenAI Whisper

python

python -m pip install git+https://github.com/openai/whisper.git

Install PyTube for our Video Download

python

python -m pip install pytube

Download a video to transcribe. In this case, I'm using one of my videos from https://cfe.sh/youtube

PyTube is awesome but please let's not abuse downloading videos this way.

python

from pytube import YouTube
YouTube('https://www.youtube.com/watch?v=VtHrTX_nTto').streams\
.filter(
  progressive=True, file_extension='mp4'
)\
.order_by('resolution')\
.first()\
.download()

Automatically rename your download to video.mp4

python

import pathlib

videos = list(pathlib.Path().glob("*.mp4"))
video = videos[0].rename("video.mp4")

Convert video.mp4 into audio.mp3 with ffmpeg

python

import subprocess

command = "ffmpeg -i video.mp4 -ab 160k -ac 2 -ar 44100 -vn audio.mp3"

subprocess.call(command, shell=True)

Use OpenAI Whisper to transcribe our video:

python

import whisper

# We're using the `base` size model. Check out 
# https://github.com/openai/whisper#available-models-and-languages
# for more robust models.
model = whisper.load_model("base") 

result = model.transcribe("audio.mp3")

print(result["text"][:240])

Results are:

Software and automation are things of the present as well as the future. And me personally, I've wanted to be a part of that for a very long time. See, I started this channel to show you how you can do the same through code, right? How you

python

import datetime

def timedelta_to_videotime(delta):
  """
  Here's a janky way to format a 
  datetime.timedelta to match 
  the format of vtt timecodes. 
  """
  parts = delta.split(":")
  if len(parts[0]) == 1:
    parts[0] = f"0{parts[0]}"
  new_data = ":".join(parts)
  parts2 = new_data.split(".")
  if len(parts2) == 1:
    parts2.append("000")
  elif len(parts2) == 2:
    parts2[1] = parts2[1][:2]
  final_data = ".".join(parts2)
  return final_data

python

def whisper_segments_to_vtt_data(result_segments):
  """
  This function iterates through all whisper
  segements to format them into WebVTT.
  """
  data = "WEBVTT\n\n"
  for idx, segment in enumerate(result_segments):
    num = idx + 1
    data+= f"{num}\n"
    start_ = datetime.timedelta(seconds=segment.get('start'))
    start_ = timedelta_to_videotime(str(start_))
    end_ = datetime.timedelta(seconds=segment.get('end'))
    end_ = timedelta_to_videotime(str(end_))
    data += f"{start_} --> {end_}\n"
    text = segment.get('text').strip()
    data += f"{text}\n\n"
  return data

python

caption_data = whisper_segments_to_vtt_data(result['segments'])

python

print(caption_data)

WEBVTT

1
00:00:00.000 --> 00:00:06.12
Software and automation are things of the present as well as the future.

2
00:00:06.12 --> 00:00:10.92
And me personally, I've wanted to be a part of that for a very long time.

3
00:00:10.92 --> 00:00:17.20
See, I started this channel to show you how you can do the same through code, right?

4
00:00:17.20 --> 00:00:22.68
How you can build something real step by step and start thinking through how you can

5
00:00:22.68 --> 00:00:25.000
automate things with software.

6
00:00:25.000 --> 00:00:30.76
And I do it with more of a business flair. That's why the entrepreneurial side of things.

7
00:00:30.76 --> 00:00:35.36
So what I hope that you do is subscribe and watch these tutorials.

8
00:00:35.36 --> 00:00:39.84
But more importantly, I hope that you do them and do something with them.

9
00:00:39.84 --> 00:00:43.76
Please let me know who you are. Tell me in the comments. I would love to check out the

10
00:00:43.76 --> 00:00:49.04
things that you're working on and what you're doing as well so we can all do this together.

11
00:00:49.04 --> 00:00:53.40
Because that's what it's all about. It's what you do. It's what we do together as a community

12
00:00:53.40 --> 00:00:57.40
to build real things and solve real problems.

13
00:00:57.40 --> 00:01:26.40
Thanks so much. Hope to see you in the series.

Save transcription to local file.

python

export_file = pathlib.Path('captions.vtt')
export_file.write_text(caption_data)