English versionEN

Voice Agents Crash Course

Stepan Suvorov

Roadsoft

This ad is not shown to multipass and full ticket holders

React Advanced 2026

October 23 - 26, 2026

London, UK & Online

We will be diving deep

Learn More

Bookmark

Sentry

Promoted

Code breaks, fix it faster

Crashes, slowdowns, regressions in prod. Seer by Sentry unifies traces, replays, errors, profiles to find root causes fast.

Get started

A practical crash course for developers who want to skip weeks of research and get straight to building. We'll cover the essential terminology, current industry trends, and the real landscape of services and libraries. You'll learn when to use what - and why some choices matter more than others. I'll share lessons learned from building a production voice agent for 10,000+ Dutch truck drivers, including the problems nobody talks about. By the end, you'll have everything you need to build your first working Voice Agent.

This talk has been presented at AI Coding Summit 2026, check out the latest edition of this Tech Conference.

FAQ

Stepan Suvorov is the CTO at Roadsoft and a fan of JNI, which he promotes through a YouTube channel.

Voice Agents are becoming popular due to improved models, lower latency, and cost-effectiveness, allowing for real-time conversation and various production use cases like customer support.

The main components of a Voice Agent system include speech-to-text (STT), language models (LLM), text-to-speech (TTS), orchestration, and transport for communication lines.

Voice Activity Detection (VAD) is a challenging component that identifies when a user has stopped speaking to allow the Voice Agent to respond, often dealing with noise and pauses.

Key factors include latency, word error rate, cost, and language support. DeepGram service is highlighted as an optimal choice.

Orchestration in Voice Agents involves combining different models and features, enabling tasks like real-time conversation and managing latency.

Challenges include adjusting speech-to-text and text-to-speech models for specific languages, as they may not be as optimized as they are for English.

For real-time interaction, use fast models for quick responses, and consider post-processing with more accurate models for transcripts or reports.

Solutions include managed services like WAPI and DIY options like Pipecat, each with different trade-offs in setup time, cost, and customization.

Stepan Suvorov's presentation at the AI Coding Summit focuses on Voice Agents.

case study

Stepan Suvorov

26 min

26 Feb, 2026

Comments

Video Summary and Transcription

Stepan Suvorov, CTO at Roadsoft, discusses the rise of Voice Agents, highlighting market growth, improved models for speech to text and voice generation, reduced latency, and cost-effectiveness. Key components include speech-to-text, LLM analysis, text-to-speech conversion, orchestration, and addressing latency. Delving into essential aspects like voice activity detection, interruption handling, and model selection criteria for Voice Agents. Exploring parameters such as latency, quality, pricing, language support, and voice cloning options. Comparing DIY and managed solutions for Voice Agents, emphasizing flexibility and cost-effectiveness. Discussing cost implications between managed and self-hosted solutions, considering usage volumes and scale. Addressing infrastructure challenges, testing approaches, WAPI limits, model selection, and future plans for voice agents.

Available in Español: Curso Intensivo de Agentes de Voz

1. Analyzing the Rise of Voice Agents

Short description:

Stepan Suvorov, CTO at Roadsoft, discusses the rise of Voice Agents, highlighting market growth, improved models for speech to text and voice generation, reduced latency, and cost-effectiveness.

Hello everyone, I'm happy to be at AI Coding Summit presenting such a nice topic like Voice Agent. Few words about me. I'm Stepan Suvorov, CTO at Roadsoft. I'm a big fan of JNI and all the possibilities that it gives us. That's why I started a YouTube channel specifically that helps people to utilize JNI as much as possible. Also feel free to connect me via my LinkedIn page to ask any questions related to JNI or Voice Agent specifically.

So Voice Agents. I want to highlight this is exactly a crash course. I did research for about several months because we are building also our own business feature based on Voice Agent. And that's a result of this research. It's a talk maybe that will help you to save some time if you just only about to start. So what about Voice Agent? Why it's so interesting now? First of all, if you look at all the numbers, we see that a lot of investment is growing and is almost doubling each year. Only if you check 11Labs company, the previous year they raised $180 million.

And beginning of this year that's $500, $500 million. That's insane. And also the evaluation was already $11 billion. Also why now? Why now the market is so active? The answer is very simple. Models get better. What specifically? Speech to text now works the way better. In terms of it makes way less mistakes. It's more reliable. Also voice generation got way better than it was even five years ago. Sometimes we even cannot realize that we were speaking with Voice Agent but not really human.

Also because models getting better and better, the latency drops significantly. That's why we can now have real-time conversation but not just generating a voice from text. And also that connects to cost. If having Voice Agent will cost you a lot of money, probably you will not go for it. But if it's reasonable, if your return on investment is good, why not to build such a feature? And there are already a lot of production use cases for Voice Agent. The most well-known is customer support, but there are a lot more.

2. Key Components and Challenges of Voice Agents

Short description:

Discussing the importance of voice as an interface for busy individuals, the key components of Voice Agent include speech-to-text, LLM analysis, text-to-speech conversion, orchestration, and addressing latency for real-time conversations.

The most well-known is customer support, but there are a lot more. And the one that we are working on I called hands are busy. Because just imagine that a lot of people, they work not near computer. And sometimes their hands are busy, that means it's either driver or construction worker. Even if they have a mobile phone where we can install a mobile application, they cannot click on mobile. And the voice is the only interface for them to interact with this application. That's why I really like this phrase from Olivia Moore about voice. So consider to be a voice, not a product itself, but as an interface to your product. So that's like a new way of interacting with software. Before it was UI, sometimes mobile UI. But now, the new wave is coming. It will be more and more.

Now let's look what is inside Voice Agent. First, we start that obviously the user needs to speak and hear some response. But what is in the middle? In the middle, we have the first part is speech-to-text model or STT. And what it does? Well, it converts speech to text. When a user speaks something, as a result, we are getting not an audio stream, but a text. This text goes to well-known LLMs, LLM model that analyzes the text and gives also a response, response in text. And the final point, what to do with this text, we send it to another model, text-to-speech or TTS model that converts it into an audio stream that we can send back to our user. And to combine this all together nicely, we need a sort of orchestration that will not only combine but will do some additional features for us. And also, if you want Voice Agent to connect it to some communication line like a phone, we also need to think about transport. So those are the main parts that you need to think of building Voice Agent.

Also, what you need to take into account is latency. If you want real-time conversation, but not only voice messages, so the agent will be able to reply, you need to aim to have it not more than 500 milliseconds. And that's quite a challenge, I would say, because if you look at all the models, first you have to do speech-to-text, then LLM, then text-to-speech, don't forget also some network latency, and you end up with a number that is significantly, like, more than two times higher than our initial goal. So here you will have to think about different sorts of optimizations to achieve this. Also some more key concepts that you better know building Voice Agents. First of all, it all goes in the stream. You don't want your text-to-speech model to start generating an audio stream while LLM is ready with a response. You want to do it immediately.

3. Optimizing Components Selection for Voice Agents

Short description:

Exploring essential aspects such as voice activity detection, interruption handling, speculative execution, and component selection criteria for Voice Agents, with a focus on DeepGram service and its models.

That's why stream goes from transport channel, it goes to text-to-speech, then to LLM, and LLM provides the result. Also, one more important definition that you will hear a lot in the scope of Voice Agents is voice activity detection. This is a super challenging part because just sending audio to speech-to-text doesn't resolve the problem of inconsistent text, background noise, or speech delays. Voice activity detection precedes speech-to-text and is closely linked to semantic turn detection, which identifies the user's speaking pauses for the agent to respond. Interruption handling is crucial for a seamless user experience, allowing users to interrupt and interact naturally with the agent, enhancing human-likeness in interactions.

Speculative execution, a term in Voice Agent context, involves generating responses from the LLM before the complete phrase is received, optimizing response time. Connecting LLM tools outside the prompt enhances the agent's capabilities beyond predefined responses. When selecting components for a Voice Agent, the evaluation process begins with speech-to-text. Parameters such as latency, quality measured by the word error rate, pricing scalability, language support, and specific business requirements play a vital role in the selection process. DeepGram service and its models emerge as a top choice, with considerations for future scalability and language diversity.

In the Voice Agent development journey, optimizing each component like text-to-speech, LLM, VAD, and orchestration is essential for efficient operations. Streamlining interruption handling, speculative execution, and seamless connections outside predefined prompts elevate the agent's performance. Evaluating speech-to-text solutions based on critical parameters like latency, quality, pricing, language support, and scalability ensures the selection of the most suitable components for a Voice Agent. DeepGram service and its models stand out as a recommended choice for their comprehensive features and adaptability to varying business needs.

4. Selecting Parameters for Voice Agent Components

Short description:

Exploring the importance of parameters such as latency, quality, pricing, and language support in selecting components, with a focus on DeepGram service and its models. Considerations for model selection and balancing speed and smartness in responses, along with key parameters for text-to-speech such as TTFA, price, languages, voices, and voice cloning options.

But that's not specific for Voice Agent, but I would say that's super important concept for any agent in general. Now we move into the next block, when we understand all the parts, we want to choose those parts for our Voice Agent. Let's go one by one. Speech to text. And speech to text what we need to compare, what parameters are important for us. Obviously, latency, the first one, how fast is it. Second one, we want to care about quality, that's why we will look for word error rate parameter. And sometimes the marketing numbers and real benchmarks are different, significantly different. So please keep this in mind.

Also obviously price, because even for small scale, the price is affordable. You need to think for future scale. Also languages. If we can 100% say that English is good for all the models, but other languages might be not so good. And based on your business case, take this into consideration. And as an ultimate winner for us for now, we found DeepGram service and two models from this service. Also for each block, for each model, I provided here a link. So slides will be shared after the talk and you can go via this link. I've built quite a comprehensive table comparing all the models that you can also check what's there.

Speaking about LLM, probably you know already a lot about LLM. The only thing that you need to take into account is that in most cases for voice agents, that's not the smart decision to take the most advanced model, because it's slow. It's slow and sometimes you need to take a comparably dumb model, but it will be fast. So you are balancing here the speed and smartness of responses that you need. Now go into text-to-speech. What text-to-speech parameters are important for us? We have a parameter that is called TTFA or time-to-first audio. That's something similar to latency, where we convert text to audio and how fast first audio stream will be. Obviously also price, also languages.

And here we have also an option, voices. How many voices are there? Are there different accents for different languages? And connected to this point, also voice cloning options. Some services provide you a possibility to clone the voice. You can clone your own, your personal voice if you want, or some specific voice and use it for your voice agent.

5. Exploring Orchestrating Solutions for Voice Agents

Short description:

Exploring 11Labs and Cartesia as top choices, emphasizing quality, variety of voices, and voice cloning. Delving into orchestration options like WAPI, Retail AI, Lifekit, and Pipecat, highlighting the do-it-yourself aspect and customizability.

And here, for now, two ultimate leaders are 11Labs and Cartesia service. They provide quite a good quality, a lot of voices, possibility to have your own clone. So definitely to consider.

Now moving to orchestration. When we selected all the parts, we need to assemble them together. And here, for orchestration, we have already several quite serious players on the market. One that we are currently using, that's WAPI or Voice API service. It gives you a possibility to use all their models. Everything is built in. You only need to connect the number.

Also, a strong competitor, Retail AI. Also, everything from the box. Two more, like Lifekit and Pipecat. Those, despite they have also possibility to host on their site, it's more a do-it-yourself solution. Here, I presented also 11Labs. It's voice generation, but they also have partial orchestration. And you can build completely custom. So, if you compare some ready service orchestration, managed service orchestration and do-it-yourself, what you need to take into account.

6. Comparing DIY and Managed Voice Agent Solutions

Short description:

Comparing time and cost implications between ready-to-use and DIY voice agent solutions. Highlighting the flexibility and cost-effectiveness of DIY approaches compared to managed solutions. Considering debugging capabilities, cost factors, and the potential lock-in effects of managed solutions.

Obviously, some service that you just need to create an account and you can start will take you probably minutes, and you can do your first call. However, for a do-it-yourself approach with existing libraries and services like Lifekit and Pipecat, it might take a bit longer. In my experience, it may take hours, but not days or weeks, contrary to some comparisons found online.

Cost-wise, there are different considerations based on the volume. For low volume, do-it-yourself solutions may cost more, while managed solutions charge per use. On the other hand, when scaling up with many minutes, managed solutions become more expensive, while do-it-yourself solutions offer a more cost-effective option. Debugging capabilities also differ significantly, with managed solutions providing limited visibility compared to the flexibility of do-it-yourself solutions.

Moreover, the lock-in effect of managed solutions should be considered, as they provide specific features that might hinder migration to a DIY setup. When evaluating costs with actual numbers, the difference becomes more pronounced. For instance, comparing the expenses for managing 3 million minutes, the self-hosted option proves to be significantly cheaper. Additionally, I found useful tools online, including a calculator, to help analyze more parameters and specific business cases for better decision-making.

7. Cost Implications and Building Voice Agents

Short description:

Discussing cost implications based on usage volumes between managed and self-hosted solutions. Mentioning the importance of understanding the scale with real numbers for a clear cost comparison. Exploring mature telephony providers like Telnyx and Trilio, with considerations for GDPR compliance and reliable services. Introducing the simplicity of building voice agents with examples like Vapi and highlighting the straightforward process of making calls through code development.

And if you make a rough estimation, it still looks like not a lot, like 10 cents, 10 cents per minute, comparing to 1 or 2 cents per minute. But if you try to evaluate with real numbers, like, let's assume you have some average startup with 20,000 active users, you want to call them each day, you want your voice agent to speak with them, each day with 5 minutes average call, you end up with 3 million minutes and it means 300k for managed solution and only 36k for self-hosted. Obviously, you will pay for support, you will pay for infrastructure, but it's just for you to understand the scale.

I also find the internet, I also find a nice calculator, I also put the link, you can play with it with more parameters and your specific business case. Speaking about telephony, what provider to connect, I found two quite mature, that's Telnyx, that's European one, if you think about GDPR data law, that's probably good to go, they're based in France. And you have also Trilio, both quite reliable, you can have a range of numbers there, and also Retail, it's also orchestration, but they partner with telephony, so you might have telephony built-in.

Now, speaking about building, building and real code, I wanted to compare, I don't want to go too deep, I strongly consider to have a separate workshop in building voice agent from scratch, using different libraries, comparing them, but for now I want to focus only on two and show you that there is nothing complex here. I took one example, Vapi, that's a ready service that you can just register, and after you can use the SDK, and providing set of parameters you can do your first call. So here we have some 20 lines of code and you can already do your call. You can do your call via interface, obviously, but from development point of view you still have to write some code.

8. Configuring Parameters and Language Considerations

Short description:

Discussing setting up parameters for voice agents, including selecting models for LLM, voices for text to speech, and services for speech to text. Exploring the Pipecat library as a DIY option for configuring speech components and pipelines. Highlighting the importance of language considerations and infrastructure adjustments for voice agents.

And here from parameters we see the number that you use, where you want to call, and description for the parts that we just discussed. For LLM we select the model, and specify the prompt, what it should do. For text to speech we select voice, and I use Cartesian specific voice ID. And speech to text, DeepGram service with NOVA2 model. And that's it. Maybe a few more parameters, but simple as it is. We can compare it to Pipecat. Pipecat is a do-it-yourself library, and normally you have to set up the way more here. But let's check the example.

I simplified code a bit, but I wanted to highlight the main concepts here. You need your transport, you need to set up your models like speech to text, text to speech, and LLM with those services. You'll have to register on those services separately, comparing to WAPI. Then you provide your context to LLM, and you create Pipecat pipeline. Pipecat works with pipelines and tasks. And in pipeline we set, okay, that will be inputs, then we put it into speech to text, then we provide all user information, and combine it into LLM. Then we have a response, response we convert back to speech, send it back to transport, and save back to Assistant what it was.

Where we created our pipeline, based on pipeline we can create a task and run this task. That's it, again, what you need to have a call on specific neighbor with your voice agent. It's quite simple. I prepared for you the whole repository with examples where you can check the code of different tools, different models, and you can play yourself. Your link is also provided. One more thing I wanted to share, some findings when we started building Voice Agent with Unreal Production feature. First of all, what I already mentioned, it was, everything was perfect when we started testing it with English. But our audience is Dutch, Dutch people, and here we face some challenges. And that's something you need to consider. What languages do you need to use for your system?

And don't think that if it works for English, it will work fine for all the rest. Because you will have to adjust your speech to text, you have to adjust the text to speech and sometimes you will not have already optimized for this specific language. So maybe there is an idea to go with limited set, make sure that your idea works and then extend those languages, you search for specific models for the language. Also for infrastructure. For infrastructure, well, first thing that I noticed, even despite the thing that we used WAPI service, ready service, it doesn't have the concept of environments.

9. Infrastructure Challenges and Future Plans

Short description:

Discussing the need for multiple environments and testing approaches. Highlighting infrastructure challenges with WAPI limits and model selection considerations for voice agents. Addressing text-to-speech challenges with reading numbers and announcing future workshops and community initiatives.

So to have different environments like staging, production and development environment, we had to create different accounts. Well, that's probably fine. After we had to think about all tests that we need. It's not only about unit tests. You want to test your prompts for LLM separately. You want to test audio separately. You want to make sure that your transport works. And also there is an interesting possibility to test your voice agent with another voice agent.

One more thing for infrastructure is that, unfortunately, the infrastructure, the code for voice agent is quite green, I would say. I tried to set it up with WAPI and created some utility scripts myself using WAPI API. They provide quite strict limits. Deploying multiple agents led to hitting the WAPI API limit. There are different situations for real-time user interaction and background processing. Super-fast models are suitable for real-time interaction, while slow but clever models work better for checking after conversations end.

On production, voice agents' text-to-speech models face challenges in reading numbers in a human way, especially dates. Additional libraries were necessary to convert digit format to words format to improve readability. The speaker plans to provide slides and code after the talk and offers to be contacted for voice agent inquiries. A voice agent workshop and a community around voice agents in Amsterdam are also in the planning. Audience members are thanked for their attention and encouraged to reach out for any voice agent-related questions.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Debugging JS

React Summit 2023

24 min

Debugging JS

Top Content

Mark Erikson

Replay.io

Debugging JavaScript is a crucial skill that is often overlooked in the industry. It is important to understand the problem, reproduce the issue, and identify the root cause. Having a variety of debugging tools and techniques, such as console methods and graphical debuggers, is beneficial. Replay is a time-traveling debugger for JavaScript that allows users to record and inspect bugs. It works with Redux, plain React, and even minified code with the help of source maps.

best practices case study javascript web development debug

Building a Voice-Enabled AI Assistant With Javascript

JSNation 2023

21 min

Building a Voice-Enabled AI Assistant With Javascript

Top Content

Tejas Kumar

Author of the "Fluent React" bestselling book, software engineer with 23 years of experience, and host of the developer-loved ConTejas Code podcast.

This Talk discusses building a voice-activated AI assistant using web APIs and JavaScript. It covers using the Web Speech API for speech recognition and the speech synthesis API for text to speech. The speaker demonstrates how to communicate with the Open AI API and handle the response. The Talk also explores enabling speech recognition and addressing the user. The speaker concludes by mentioning the possibility of creating a product out of the project and using Tauri for native desktop-like experiences.

case study artificial intelligence

A Framework for Managing Technical Debt

TechLead Conference 2023

35 min

A Framework for Managing Technical Debt

Top Content

Fredrik Höglund

ephem.dev

React query version five is live and we'll be discussing the migration process to server components using Next.js and React Query. The process involves planning, preparing, and setting up server components, migrating pages, adding layouts, and moving components to the server. We'll also explore the benefits of server components such as reducing JavaScript shipping, enabling powerful caching, and leveraging the features of the app router. Additionally, we'll cover topics like handling authentication, rendering in server components, and the impact on server load and costs.

react react query next.js case study react server components react 18

Monolith to Micro-Frontends

React Advanced 2022

22 min

Monolith to Micro-Frontends

Top Content

Ruben Casas

Postman

Microfrontends are considered as a solution to the problems of exponential growth, code duplication, and unclear ownership in older applications. Transitioning from a monolith to microfrontends involves decoupling the system and exploring options like a modular monolith. Microfrontends enable independent deployments and runtime composition, but there is a discussion about the alternative of keeping an integrated application composed at runtime. Choosing a composition model and a router are crucial decisions in the technical plan. The Strangler pattern and the reverse Strangler pattern are used to gradually replace parts of the monolith with the new application.

case study micro-frontends developer challenges micro frontends react

Power Fixing React Performance Woes

React Advanced 2023

22 min

Power Fixing React Performance Woes

Top Content

Josh Goldberg

Open Source enthusiast, TypeScript contributor, writing a book on Typescript

This Talk discusses various strategies to improve React performance, including lazy loading iframes, analyzing and optimizing bundles, fixing barrel exports and tree shaking, removing dead code, and caching expensive computations. The speaker shares their experience in identifying and addressing performance issues in a real-world application. They also highlight the importance of regularly auditing webpack and bundle analyzers, using tools like Knip to find unused code, and contributing improvements to open source libraries.

react performance case study

Workshops on related topic

Building a Shopify App with React & Node

React Summit Remote Edition 2021

87 min

Building a Shopify App with React & Node

Top Content

Workshop

2 authors

Shopify merchants have a diverse set of needs, and developers have a unique opportunity to meet those needs building apps. Building an app can be tough work but Shopify has created a set of tools and resources to help you build out a seamless app experience as quickly as possible. Get hands on experience building an embedded Shopify app using the Shopify App CLI, Polaris and Shopify App Bridge.We’ll show you how to create an app that accesses information from a development store and can run in your local environment.

case study e-commerce shopify shopify with react

Build a chat room with Appwrite and React

JSNation 2022

41 min

Build a chat room with Appwrite and React

Workshop

Wess Cope

API's/Backends are difficult and we need websockets. You will be using VS Code as your editor, Parcel.js, Chakra-ui, React, React Icons, and Appwrite. By the end of this workshop, you will have the knowledge to build a real-time app using Appwrite and zero API development. Follow along and you'll have an awesome chat app to show off!

case study web development realtime react chat app

Hard GraphQL Problems at Shopify

GraphQL Galaxy 2021

164 min

Hard GraphQL Problems at Shopify

Workshop

5 authors

At Shopify scale, we solve some pretty hard problems. In this workshop, five different speakers will outline some of the challenges we’ve faced, and how we’ve overcome them.

Table of contents:
1 - The infamous "N+1" problem: Jonathan Baker - Let's talk about what it is, why it is a problem, and how Shopify handles it at scale across several GraphQL APIs.
2 - Contextualizing GraphQL APIs: Alex Ackerman - How and why we decided to use directives. I’ll share what directives are, which directives are available out of the box, and how to create custom directives.
3 - Faster GraphQL queries for mobile clients: Theo Ben Hassen - As your mobile app grows, so will your GraphQL queries. In this talk, I will go over diverse strategies to make your queries faster and more effective.
4 - Building tomorrow’s product today: Greg MacWilliam - How Shopify adopts future features in today’s code.
5 - Managing large APIs effectively: Rebecca Friedman - We have thousands of developers at Shopify. Let’s take a look at how we’re ensuring the quality and consistency of our GraphQL APIs with so many contributors.

case study scalability graphql

Build Modern Applications Using GraphQL and Javascript

Node Congress 2024

152 min

Build Modern Applications Using GraphQL and Javascript

Workshop

2 authors

Come and learn how you can supercharge your modern and secure applications using GraphQL and Javascript. In this workshop we will build a GraphQL API and we will demonstrate the benefits of the query language for APIs and what use cases that are fit for it. Basic Javascript knowledge required.

case study web development graphql

Build a knowledge base with Gatsby, Contentful and AWS

React Summit 2022

152 min

Build a knowledge base with Gatsby, Contentful and AWS

Workshop

Abdelrhman Adel

In this workshop, we will go over how to build a knowledge base using Gatsby, a static site generator Framework that uses React and graphQL, Contentful, a Headless CMS to drive the content and deploy it to AWS S3.

case study gatsby graphql aws

0 To Auth In An Hour For Your JavaScript App

JSNation 2023

57 min

0 To Auth In An Hour For Your JavaScript App

WorkshopFree

Asaf Shen

Passwordless authentication may seem complex, but it is simple to add it to any app using the right tool.
We will enhance a full-stack JS application (Node.js backend + Vanilla JS frontend) to authenticate users with One Time Passwords (email) and OAuth, including:
- User authentication – Managing user interactions, returning session / refresh JWTs- Session management and validation – Storing the session securely for subsequent client requests, validating / refreshing sessions
At the end of the workshop, we will also touch on another approach to code authentication using frontend Descope Flows (drag-and-drop workflows), while keeping only session validation in the backend. With this, we will also show how easy it is to enable biometrics and other passwordless authentication methods.

case study authentication