Building Your Own Speech-to-Text Service with Whisper
Building Your Own Speech-to-Text Service with Whisper
Introduction
In the realm of English language learning, transcribing audio files is a crucial yet time-consuming task. Manual transcription is not only laborious but also prone to errors. This article explores how to streamline this process by leveraging Speech-to-Text (STT) technology, specifically focusing on building a service using the open-source Whisper model.
Understanding Whisper
Whisper is an innovative open-source speech recognition model developed by researchers at OpenAI (not Mozilla as previously stated). It has gained popularity due to its:
- High accuracy across multiple languages
- Ability to handle diverse accents and background noises
- Detailed transcriptions suitable for educational purposes
Creating a REST API for Whisper
To make Whisper more accessible and integrate it into our authoring tools, we’ll wrap it in a REST API. This approach allows for easy scalability and integration with various applications.
Setting Up the Whisper API
Follow these steps to set up the Whisper API using Docker:
1
2
3
4
5
6
7
8
9
10
11
# Clone the repository
git clone https://github.com/reallyenglish-global/whisper-api-flask
# Build the Docker image
docker build . -t whisper
# Run the service
docker run -p 9000:5000 -e MODEL=small -d whisper
# Test the API
curl -F "file=@your_audio_file.mp3" http://0.0.0.0:9000/whisper
Developing a Ruby Client
To interact with our Whisper API, we’ll create a Ruby client. This client will handle API communication and process transcription results.
Ruby Client Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# frozen_string_literal: true
require 'faraday'
require 'faraday/multipart'
class SpeechToText
class ClientError < StandardError; end
class ServerError < StandardError; end
def self.enabled?
ENV.fetch('WHISPER_ENDPOINT', '').present?
end
def convert(audio)
audio = File.new(audio) if !audio.is_a?(File) && File.file?(audio)
response = conn.post(endpoint, payload(audio.path))
raise ServerError, "Error: #{response.status} - #{response.body}" unless response.success?
results = response.body[:results]
raise ClientError, 'No results found' if results.blank?
results[0][:transcript]
end
private
def conn
@conn ||= Faraday.new do |f|
f.request :multipart
f.adapter :net_http
f.headers['Content-Type'] = 'multipart/form-data'
f.response :json, parser_options: { symbolize_names: true }
end
end
def endpoint
@endpoint ||= ENV.fetch('WHISPER_ENDPOINT', '')
end
def payload(file_path)
{
file: Faraday::Multipart::FilePart.new(file_path, 'audio/mp3'),
response_format: 'verbose_json'
}
end
end
Deploying to Kubernetes
For production environments, deploying the Whisper API within a Kubernetes cluster ensures scalability and reliability.
Kubernetes Deployment Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
apiVersion: apps/v1
kind: Deployment
metadata:
name: whisper
namespace: speech
labels:
app: whisper
spec:
replicas: 1
selector:
matchLabels:
app: whisper
template:
metadata:
labels:
app: whisper
spec:
containers:
- name: whisper
image: ghcr.io/reallyenglish-global/whisper-api-flask
imagePullPolicy: Always
ports:
- containerPort: 5000
env:
- name: MODEL
value: base
readinessProbe:
httpGet:
path: /
port: 5000
Kubernetes Service Configuration
1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: Service
metadata:
name: whisper
namespace: speech
spec:
selector:
app: whisper
ports:
- port: 5000
targetPort: 5000
Integrating the Service
Incorporating the Whisper API into your application is straightforward with the Ruby client:
1
2
3
service = SpeechToText.new
transcript = service.convert('path/to/audio_file.mp3')
# Process the transcript as needed
Considerations and Optimizations
While Whisper offers a powerful solution for speech transcription, consider the following:
- Performance: GPU acceleration significantly improves processing speed.
- Resource Usage: The model requires substantial memory, impacting hosting costs.
- Scalability: Different model sizes offer trade-offs between accuracy and resource consumption.
- API Security: Implement authentication to control access to your API.
- Error Handling: Implement robust error handling for various scenarios (network issues, invalid audio files, etc.).
Future Enhancements
To further improve the service, consider:
- Implementing Caching: Store frequently requested transcriptions to reduce processing load.
- Adding Language Detection: Automatically detect the spoken language for multi-language support.
- Integrating Analytics: Track usage patterns to optimize resource allocation.
- Implementing Batch Processing: Allow multiple audio files to be processed in a single request.
Conclusion
Building a custom Speech-to-Text service using Whisper can significantly enhance the efficiency of creating English learning resources. By following this guide, you can create a robust, scalable solution tailored to your specific needs, making the transcription process more accurate and less time-consuming.
Remember to stay updated with the latest developments in the Whisper project, as continuous improvements may offer new features and enhanced performance over time.