This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
What is the Whisper model?
- 3 contributors
The Whisper model is a speech to text model from OpenAI that you can use to transcribe audio files. The model is trained on a large dataset of English audio and text. The model is optimized for transcribing audio files that contain speech in English. The model can also be used to transcribe audio files that contain speech in other languages. The output of the model is English text.
Whisper models are available via the Azure OpenAI Service or via Azure AI Speech. The features differ for those offerings. In Azure AI Speech (batch transcription) , Whisper is just one of several models that you can use for speech to text.
You might ask:
Is the Whisper Model a good choice for my scenario, or is an Azure AI Speech model better? What are the API comparisons between the two types of models?
If I want to use the Whisper Model, should I use it via the Azure OpenAI Service or via Azure AI Speech ? What are the scenarios that guide me to use one or the other?
Whisper model or Azure AI Speech models
Either the Whisper model or the Azure AI Speech models are appropriate depending on your scenarios. If you decide to use Azure AI Speech, you can choose from several models, including the Whisper model. The following table compares options with recommendations about where to start.
Whisper model via Azure AI Speech or via Azure OpenAI Service?
If you decide to use the Whisper model, you have two options. You can choose whether to use the Whisper Model via Azure OpenAI or via Azure AI Speech (batch transcription) . In either case, the readability of the transcribed text is the same. You can input mixed language audio and the output is in English.
Whisper Model via Azure OpenAI Service might be best for:
- Quickly transcribing audio files one at a time
- Translate audio from other languages into English
- Provide a prompt to the model to guide the output
- Supported file formats: mp3, mp4, mpweg, mpga, m4a, wav, and webm
- Only ASCII character supported for filename
Whisper Model via Azure AI Speech batch transcription might be best for:
- Transcribing files larger than 25MB (up to 1GB). The file size limit for the Azure OpenAI Whisper model is 25 MB.
- Transcribing large batches of audio files.
- Diarization to distinguish between the different speakers participating in the conversation. The Speech service provides information about which speaker was speaking a particular part of transcribed speech. The Whisper model via Azure OpenAI doesn't support diarization.
- Word-level timestamps
- Supported file formats: mp3, wav, and ogg.
Regional support is another consideration.
- The Whisper model via Azure OpenAI Service is available in the following regions: East US 2, India South, North Central, Norway East, Sweden Central, Switzerland North, and West Europe.
- The Whisper model via Azure AI Speech is available in the following regions: Australia East, East US, North Central US, South Central US, Southeast Asia, UK South, and West Europe.
Related content
- Use Whisper models via the Azure AI Speech batch transcription API
- Try the speech to text quickstart for Whisper via Azure OpenAI
- Try the real-time speech to text quickstart via Azure AI Speech
Was this page helpful?
Additional resources
How to Turn Audio to Text using OpenAI Whisper
Do you know what OpenAI Whisper is? It’s the latest AI model from OpenAI that helps you to automatically convert speech to text.
Transforming audio into text is now simpler and more accurate, thanks to OpenAI’s Whisper.
This article will guide you through using Whisper to convert spoken words into written form, providing a straightforward approach for anyone looking to leverage AI for efficient transcription.
Introduction to OpenAI Whisper
OpenAI Whisper is an AI model designed to understand and transcribe spoken language. It is an automatic speech recognition (ASR) system designed to convert spoken language into written text.
Its capabilities have opened up a wide array of use cases across various industries. Whether you’re a developer, a content creator, or just someone fascinated by AI, Whisper has something for you.
Let's go over some its key features:
1. Transcription s ervices: Whisper can transcribe audio and video content in real-time or from recordings, making it useful for generating accurate meeting notes, interviews, lectures, and any spoken content that needs to be documented in text form.
2. Subtitling and c losed c aptioning: It can automatically generate subtitles and closed captions for videos, improving accessibility for the deaf and hard-of-hearing community, as well as for viewers who prefer to watch videos with text.
3. Language l earning and t ranslation : Whisper's ability to transcribe in multiple languages supports language learning applications, where it can help in pronunciation practice and listening comprehension. Combined with translation models, it can also facilitate real-time cross-lingual communication.
4. Accessibility t ools: Beyond subtitling, Whisper can be integrated into assistive technologies to help individuals with speech impairments or those who rely on text-based communication. It can convert spoken commands or queries into text for further processing, enhancing the usability of devices and software for everyone.
5. Content s earchability: By transcribing audio and video content into text, Whisper makes it possible to search through vast amounts of multimedia data. This capability is crucial for media companies, educational institutions, and legal professionals who need to find specific information efficiently.
6. Voice- c ontrolled a pplications: Whisper can serve as the backbone for developing voice-controlled applications and devices. It enables users to interact with technology through natural speech. This includes everything from smart home devices to complex industrial machinery.
7. Customer s upport a utomation: In customer service, Whisper can transcribe calls in real time. It allows for immediate analysis and response from automated systems. This can improve response times, accuracy in handling queries, and overall customer satisfaction.
8. Podcasting and j ournalism: For podcasters and journalists, Whisper offers a fast way to transcribe interviews and audio content for articles, blogs, and social media posts, streamlining content creation and making it accessible to a wider audience.
OpenAI's Whisper represents a significant advancement in speech recognition technology.
With its use cases spanning across enhancing accessibility, streamlining workflows, and fostering innovative applications in technology, it's a powerful tool for building modern applications.
How to Work with Whisper
Now let’s look at a simple code example to convert an audio file into text using OpenAI’s Whisper. I would recommend using a Google Collab notebook .
Before we dive into the code, you need two things:
- OpenAI API Key
- Sample audio file
First, install the OpenAI library (Use ! only if you are installing it on the notebook):
Now let’s write the code to transcribe a sample speech file to text:
This script showcases a straightforward way to use OpenAI Whisper for transcribing audio files. By running this script with Python, you’ll see the transcription of your specified audio file printed to the console.
Feel free to experiment with different audio files and explore additional options provided by the Whisper Library to customize the transcription process to your needs.
Tips for Better Transcriptions
Whisper is powerful, but there are ways to get even better results from it. Here are some tips:
- Clear a udio: The clearer your audio file, the better the transcription. Try to use files with minimal background noise.
- Language s election: Whisper supports multiple languages. If your audio isn’t in English, make sure to specify the language for better accuracy.
- Customiz e o utput: Whisper offers options to customize the output. You can ask it to include timestamps, confidence scores, and more. Explore the documentation to see what’s possible.
Advanced Features
Whisper isn’t just for simple transcriptions. It has features that cater to more advanced needs:
- Real- t ime t ranscription : You can set up Whisper to transcribe the audio in real time. This is great for live events or streaming.
- Multi- l anguage s upport: Whisper can handle multiple languages in the same audio file. It’s perfect for multilingual meetings or interviews.
- Fine- t uning: If you have specific needs, you can fine-tune Whisper’s models to suit your audio better. This requires more technical skill but can significantly improve results.
Working with OpenAI Whisper opens up a world of possibilities. It’s not just about transcribing audio – it’s about making information more accessible and processes more efficient.
Whether you’re transcribing interviews for a research project, making your podcast more accessible with transcripts, or exploring new ways to interact with technology, Whisper has you covered.
Hope you enjoyed this article. Visit turingtalks.ai for daily byte-sized AI tutorials.
Read more posts .
If this article was helpful, share it .
Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started
IMAGES
VIDEO