Let’s start with the demo for the Real time VoiceBot,
Speech to speech models are taking input as audio and gives output also as audio. This audio is in format of audio stream frame. These models are capable of sending and receiving bidirectional streams.
Obviously, this approach of developing voice bot with such models gives more human like experience and faster response time. Once this kind of speech to speech models are being introduced in the market, everyone is looking for making voice bots with these models. Here, we will talk about AWS Nova Sonic speech to speech model that gives very nice real time experience for the voice bots.
So, now how we can actually build voice bots with AWS Nova Sonic model. For this we will have to take following approach.
Steps to make Speech to Speech VoiceBot:
– Take any telephony server that can stream voice call to your application
– Make an application that can handle the voice streaming from the telephony server
– Send the voice stream to Nova Sonic Model which is received from telephony
– Return the received audio output from Nova Sonic to telephony to announce response on call
This is the overall idea which I followed to make this real time VocieBot.
VoiceBot Capabilities:
– Can handle interruption (user can interrupt the bot any time)
– Transcription
– Transcription Analysis (intent, sentiment, summary etc.)
– Call Handoff with such intelligent details, so human agent can have better idea when receiving customer call
– Bot flow creation and creating overall voice personality of your assistant
– Calling external logic (APIs) during conversation with Bot
– Adding filler sounds when needed, to give more latency optimized and human like response
Architecture of Speech To Speech Voice Bot

In above diagram I have shown all important components of speech to speech voicebot. Description of every component have mentioned below:
Telephony Server: Can be any switch like Asterisk, Freeswitch that can stream the call to web-socket application.
Stream Manager: This is web-socket application that can handle the voice stream sent by telephony and process towards Nova Sonic model, also returns the received audio output from Nova Sonic model back to telephony.
AWS Nova Sonic: This is the AWS speech to speech model which we are using
Prompt: Here we can write the set of instructions how our assistant can behave and achieve particular objective (like ticket booking).
Tool: This is an external function defined in prompt to be called by LLM at appropriate time. Here you can call external APIs and logic.
External Logic: The blocks Call Handoff and External APIs are examples of tool calling.
Here, I like to share some comparisons between our old way of developing voicebot versus new way speech to speech voicebot. In beginning we were using TTS/STT/LLM approach to build voicebot, nowadays we are also using the new approach of speech to speech LLM model.
| TTS/STT/LLM Approach | Speech2Speech Approch |
|---|---|
| Higher Response time | Lower Response time |
| Chance to have any custom LLM query before returning answer | Direct speech response is being generated by the model |
| Good for intent based routing and have more control over call flow | Decides automatically when to route and where to route call flow |
| Moderate real-time experience | Perfect real-time and human like experience |
| May feel Robotic call flow | Nice human like call flow experience with emotions |
| Complex to build call flow | Very easy to build call flow |
| More coding required | Less coding required |