In at this time’s fast-paced digital world, the flexibility to transform speech to textual content effectively and precisely is extra essential than ever. Whether or not it’s for enhancing customer support, making certain compliance, or driving data-driven insights, companies throughout varied sectors are searching for strong options for transcribing cellphone calls. However how do you construct a system that may deal with various name volumes, guarantee excessive accuracy, and scale effortlessly?
On this article, we’ll dive into how we constructed a scalable speech-to-text transcription service utilizing Azure Kubernetes Service (AKS), Azure Cognitive Providers, and Twilio. We’ll discover the structure, key code snippets, and the challenges we confronted alongside the best way.
Our aim was to create a system that would:
- Deal with real-time transcription of cellphone calls
- Scale routinely primarily based on name quantity
- Guarantee excessive availability and fault tolerance
- Securely retailer transcriptions for future evaluation
We determined to leverage the ability of Kubernetes orchestration, Azure’s cloud companies, and Twilio’s communication platform to construct our answer. Right here’s a high-level overview of our structure:Coplio] → [AKS Cluster] → [Azure Speech-to-Text] → [Azure Blob Storage
Here’s a detailed high-level diagram of our system architecture:
[Caller] ----> [Twilio] ----> [Azure Load Balancer]
|
v
[AKS Cluster]
|
[Ingress Controller]
|
[Speech-to-Text App Pods]
/ |
[Azure Speech API] [Azure Blob] [Azure Monitor]
| |
[Cognitive Services] [Storage Account]
Let’s break down every element and take a look at some key code snippets.
We used Flask, a light-weight Python net framework, to create an API that handles Twilio webhooks. Right here’s a simplified model of our principal software code:
from flask import Flask, request
from azure.cognitiveservices.speech import SpeechConfig, AudioConfig, SpeechRecognizer
from azure.storage.blob import BlobServiceClient
import os
app = Flask(__name__)@app.route("/transcribe", strategies=['POST'])
def transcribe():
recording_url = request.values.get("RecordingUrl", None)
if recording_url:
transcription = perform_transcription(recording_url)
save_to_blob(transcription, request.values)
return "Transcription accomplished", 200
else:
return "No recording URL offered", 400# ... remainder of the code
This Flask software exposes a /transcribe
endpoint that Twilio can name with the recording URL. We then use Azure’s Speech-to-Textual content service to transcribe the audio and save the outcome to Azure Blob Storage.
The center of our service is the transcription performance. We use Azure’s Speech SDK to transform the audio to textual content:
def perform_transcription(audio_url):
speech_config = SpeechConfig(subscription=os.getenv('AZURE_SPEECH_KEY'), area=os.getenv('AZURE_SPEECH_REGION'))
audio_config = AudioConfig(filename=audio_url)
recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)outcome = recognizer.recognize_once()
return outcome.textual content
This perform takes the audio URL offered by Twilio, configures the Azure Speech service, and returns the transcribed textual content.
The center of our service is the transcription performance powered by Azure’s Speech-to-Textual content API. This subtle API provides a variety of options that transcend primary transcription, together with real-time streaming, batch transcription, and speaker diarization.
Let’s take a more in-depth take a look at how we implement transcription and diarization:
import azure.cognitiveservices.speech as speechsdk
def perform_transcription_with_diarization(audio_url):
speech_config = speechsdk.SpeechConfig(subscription=os.getenv('AZURE_SPEECH_KEY'), area=os.getenv('AZURE_SPEECH_REGION'))# Allow speaker diarization
speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_LanguageIdMode, "Steady")
speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationSilenceTimeoutMs, "100")audio_config = speechsdk.audio.AudioConfig(filename=audio_url)
conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
accomplished = False
result_text = []
speaker_count = 0
audio system = {} def transcribed_cb(evt):
nonlocal speaker_count
if evt.outcome.textual content:
if evt.outcome.speaker_id not in audio system:
speaker_count += 1
audio system[evt.result.speaker_id] = f"Speaker {speaker_count}"
speaker = audio system[evt.result.speaker_id]
textual content = f"{speaker}: {evt.outcome.textual content}"
result_text.append(textual content) conversation_transcriber.transcribed.join(transcribed_cb)
conversation_transcriber.start_transcribing_async() whereas not accomplished:
time.sleep(.5) conversation_transcriber.stop_transcribing_async()
return "n".be part of(result_text)
This enhanced model of our transcription perform leverages a number of key options of the Azure Speech-to-Textual content API:
- Steady Language Identification: By setting
SpeechServiceConnection_LanguageIdMode
to “Steady”, we allow the API to detect language adjustments all through the audio, which is essential for multi-language conversations. - Segmentation Silence Timeout: The
Speech_SegmentationSilenceTimeoutMs
property units the silence period that is used to find out the tip of a speech section. This helps in precisely separating utterances from completely different audio system. - Dialog Transcriber: As a substitute of utilizing a easy
SpeechRecognizer
, we use theConversationTranscriber
. This class is particularly designed for multi-speaker situations and gives speaker diarization capabilities.
Understanding Diarization
Speaker diarization is the method of partitioning an audio stream into homogeneous segments in accordance with the speaker’s identification. In less complicated phrases, it’s the expertise that enables us to find out “who spoke when” in an audio recording.
Azure’s diarization system works by:
- Segmenting the Audio: The audio is first cut up into small segments, usually a couple of seconds lengthy.
- Extracting Options: From every section, the system extracts acoustic options which might be attribute of particular person audio system.
- Clustering: These options are then clustered to group segments that seemingly belong to the identical speaker.
- Speaker Identification: Every cluster is assigned a singular speaker ID.
In our implementation, we deal with diarization leads to the transcribed_cb
callback perform. When a brand new speaker is detected (indicated by a brand new speaker_id
), we assign them a human-readable label (e.g., “Speaker 1”, “Speaker 2”). This permits us to current the transcription in a format that clearly delineates completely different audio system:
Speaker 1: Good day, how can I provide help to at this time?
Speaker 2: I am calling about my current order.
Speaker 1: Actually, I might be comfortable to help you with that.
This diarized transcription gives helpful context that may be essential for purposes like customer support evaluation, assembly transcription, or interview processing.
Whereas Azure’s diarization capabilities are highly effective, it’s necessary to notice some challenges:
- Overlapping Speech: When a number of folks converse concurrently, diarization can battle to precisely separate the audio system.
- Audio High quality: Background noise or poor audio high quality can impression diarization accuracy.
- Speaker Rely: The API doesn’t routinely decide the variety of audio system; it assigns IDs because it detects completely different voices.
- Speaker Identification: Whereas the API separates audio system, it doesn’t establish who they’re. For purposes requiring named speaker identification, further processing can be wanted.
To deal with these challenges, we applied a number of methods:
- We use the
Speech_SegmentationSilenceTimeoutMs
property to fine-tune speaker segmentation. - Our system maintains a mapping of speaker IDs to human-readable labels, offering consistency throughout the transcript.
- We save each the uncooked diarization outcomes and our processed, labeled transcript, permitting for future reprocessing if wanted.
After transcription, we retailer the leads to Azure Blob Storage for future retrieval and evaluation:
def save_to_blob(transcription, metadata):
connection_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(os.getenv('AZURE_STORAGE_CONTAINER_NAME'))blob_client = container_client.get_blob_client(f"transcription_{metadata['CallSid']}.txt")
blob_client.upload_blob(transcription)
This perform creates a singular blob for every transcription, utilizing the Twilio Name SID as an identifier.
To make sure our software is scalable and simply deployable, we containerized it utilizing Docker and deployed it to Azure Kubernetes Service. Right here’s a snippet from our Kubernetes deployment YAM
apiVersion: apps/v1
sort: Deployment
metadata:
identify: speech-to-text-app
spec:
replicas: 3
selector:
matchLabels:
app: speech-to-text-app
template:
metadata:
labels:
app: speech-to-text-app
spec:
containers:
- identify: speech-to-text-app
picture: your-acr-name.azurecr.io/speech-to-text-app:v1
ports:
- containerPort: 8000
env:
- identify: AZURE_SPEECH_KEY
valueFrom:
secretKeyRef:
identify: azure-secrets
key: speech-key
# ... different atmosphere variables
This deployment ensures that we all the time have three replicas of our software operating, with the flexibility to scale up or down as wanted.
Constructing this method wasn’t with out its challenges. Listed below are a couple of we encountered and the way we solved them:
- Twilio Webhook Configuration: We arrange an Azure Utility Gateway as an ingress controller to offer a steady exterior IP for Twilio to connect with.
- Azure Blob Storage Permissions: We configured the Managed Id for our AKS cluster and granted it the mandatory permissions on the storage account.
- Kubernetes Secret Administration: We applied Kubernetes Secrets and techniques to securely handle delicate data like API keys.
- Scaling Below Load: We applied Horizontal Pod Autoscaling in Kubernetes to routinely alter the variety of pods primarily based on CPU utilization.
Constructing a scalable speech-to-text transcription service utilizing Azure Kubernetes Service, Azure Cognitive Providers, and Twilio allowed us to create a strong, scalable answer that may deal with real-time transcription of cellphone calls. By leveraging cloud-native applied sciences and microservices structure, we have been in a position to create a system that may simply scale to satisfy demand and supply dependable service.
The mix of containerization, Kubernetes orchestration, and cloud companies gives a robust framework for constructing complicated, scalable purposes. Whether or not you’re constructing a transcription service or every other kind of scalable software, these applied sciences provide a versatile and strong answer.
Bear in mind, the important thing to success with such a system lies not simply within the preliminary implementation, however in steady monitoring, optimization, and iteration. As you construct and deploy your individual scalable companies, continue to learn, maintain enhancing, and most significantly, maintain coding!