Audio to Text with Python (Whisper) - Full Guide

Documentation: https://github.com/openai/whisper?utm_source=chatgpt.com

 

Overview

Whisper, simple to use, a little bit more demanding to set up. Let's take a look at the dependencies below:

 

Whisper
   │
   ├── PyTorch
   │      └── Visual C++ Runtime
   │
   └── FFmpeg

 


Installation


Visual C++ Runtime

Get-WmiObject -Class Win32_Product | Where-Object {$_.Name -like "*Visual C++*"} | Select Name

Confirm that you have the following installed:

  • Microsoft Visual C++ 2022 X64 Minimum Runtime

  • Microsoft Visual C++ 2022 X64 Additional Runtime

If not:



PyTorch


(To avoid conflicts with other Python packages, it is recommended to create a dedicated virtual environment for this project)
 

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu


Confirm the installation:

python -c "import torch; print(torch.__version__)"

 

The expected output will be something like:
2.x.x+cpu

 

FFmpeg

Whisper uses FFmpeg to decode and read audio files. Check whether FFmpeg is already installed:

ffmpeg -version


If the command is not recognized, you will need to install it:

From https://www.gyan.dev/ffmpeg/builds/, download: ffmpeg-git-essentials.7z

Create a folder named ffmpeg in C:/ and drag the content of the download to ffmpeg:

C:/
└── ffmpeg/
    ├── bin/
    │   ├── ffmpeg.exe
    │   ├── ffplay.exe
    │   └── ffprobe.exe
    ├── doc/
    └── LICENSE


Adding ffmpeg to PATH:

Environment Variables > Edit Environment Variables > User Variables > PATH > New > C:/ffmpeg/bin
 



 

Whisper


Installation

pip install openai-whisper



Usage

whisper audio.mp3 --model base --language English

 

 

Available models

Whisper provides multiple models, allowing you to balance speed and accuracy:

  • tiny
  • base
  • small
  • medium
  • large
  • turbo
    ​​​

Larger models generally offer better accuracy at the cost of higher processing requirements. 
I've tried it on multiple audios with the base model and did a good job with the English language.

 

Language support

The same flexibility applies to languages. Whisper supports a wide range of languages, which can be explicitly specified or automatically detected, making it suitable for multilingual transcription workflows.

 



When used correctly, this tool can be a real game changer.
Long hours of meetings, interviews, or conversations can be automatically transcribed by a computer and later summarised, all with the help of AI.


 


Django 5.2
openai-whisper==20250625