No Dependencies, No Problem: Streaming AI Over the Phone

This ad is not shown to multipass and full ticket holders
JSNation US
JSNation US 2025
November 17 - 20, 2025
New York, US & Online
JS stars in the US biggest planetarium
Learn More
In partnership with Focus Reactive
Upcoming event
JSNation US 2025
JSNation US 2025
November 17 - 20, 2025. New York, US & Online
Learn more
Bookmark
Rate this content

What if you could build a phone agent that listens, thinks, and speaks – without touching a single speech-to-text API or wrangling cloud infrastructure? What if all it took was a WebSocket and some JavaScript you already know?
In this talk, you’ll see how to wire up a minimal AI voice loop using modern tools like Bun, with no dependencies and no boilerplate in the way. It’s a quick, practical demo that puts the focus back on business logic – and shows just how little it takes to get started.

This talk has been presented at JSNation 2025, check out the latest edition of this JavaScript Conference.

FAQ

Marius works on the developer relations team at Twilio, where he interacts with developers using Twilio's APIs.

Developers often face latency issues where there is a delay between spoken input and the AI's response, disrupting the user experience.

Twilio offers automatic speech recognition and text-to-speech services using best-in-class providers, allowing developers to offload some responsibilities to Twilio's infrastructure.

A WebSocket server receives text data, which can then be processed by local language models with low latency, and streams text back as responses.

By using Twilio's infrastructure for automatic speech recognition and text-to-speech, and hosting a WebSocket server to process text with low latency models, developers can minimize latency.

In the example, BUN is used to host a simple web server.

Initially, the AI assistant was hard-coded to respond with 'that's a great question,' but later used GPT-4 to provide dynamic, context-aware answers.

The response time increases slightly when the AI performs a web search, but this can be mitigated with latency-reducing strategies.

The example queries included 'What's the capital of France?' and 'Who won the UEFA Nations League last weekend?'

Marius Obert
Marius Obert
6 min
12 Jun, 2025

Comments

Sign in or register to post your comment.
Video Summary and Transcription
Marius from Twilio demonstrates building AI agents for phone calls, addressing latency issues by leveraging Twilio's infrastructure and third-party providers like 11 Labs and Google Cloud. Configuration includes WebSocket integration for message handling, static responses, and text-to-speech with 11 Labs. AI integration involves GPT4 or mini model for conversation history storage. A live demo showcases an AI voice assistant with instant responses and latency improvements.

1. Building AI Agents for Phone Calls

Short description:

Marius from Twilio discusses building AI agents for phone calls, addressing latency issues by leveraging Twilio's infrastructure and third-party providers like 11 Labs and Google Cloud. Host a WebSockets server, process text with LLMs, and achieve low-latency communication. Demonstrates building an agent in three minutes using BUN for web server hosting and WebSocket integration.

Hi, everyone. I'm Marius. I work on the developer relations team at Twilio, and that means I speak to a lot of developers who use our APIs, such as the text messaging API or the voice API. And one thing that a lot of developers recently wanted to build is an AI agent that can make a phone call or receive a phone call. Let me show you how this story often goes from a developer perspective. You have all the great models that you want to combine, one for automatic speech recognition, one for interrupt detection, and the text-to-speech model. You combine them all together, and, in theory, it works nice, but then they quickly realize latency.

The latency is just you say something, you wait, nothing happens, and then you say something again, and then the model starts to talk, and that kills the entire experience. So they need to find a way to work around this. Something that we provide at Twilio is that you can basically left shift a lot of responsibility on our infrastructure, such as automatic speech recognition, text-to-speech, and we work with best-in-class providers, such as 11 Labs or Google Cloud, to provide these services, and you just need to focus on the configuration.

And what you actually need to do in the end is to host a WebSockets server that receives text, and then you can process it with your own LLMs. You can post it to an LLM that runs close to your machine with low latency, and you just stream text back. And, actually, you can build an agent in three minutes. Let's do that together. So I use BUN to host a simple web server. You can just, for the fun of using a new stack every now and then, I expose it on port 5050, and then I have this fetch function that the only thing it does, it upgrades HTTP to WebSocket, and then I attach a data object, so I can recognize the same stream again.

2. WebSocket Configuration and AI Integration

Short description:

Configuring WebSocket callbacks for message handling, logging prompts, and implementing static responses. Integration with 11 Labs for text-to-speech. Involving AI with GPT4 or mini model for conversation history storage and retrieval.

It would make sense to use the phone number here, but I would just use a timestamp, because I would be the only one calling it anyway. And then in the WebSocket configuration, I just have a callback when the socket is open. When it's closed, let's format it a bit. And then the interesting part happens here, when a message comes in. I pass the JSON payload, and then when the message is of type prompt, which it will mostly be. Anyway, I log this to the console, and for now, let's have a hard-coded answer to say that's a great question. And then I log it to the console, and I just stream it back. And what I also do is I log the other types so you see them, but we don't have to worry about them for now.

Okay, let's give it a run. I run my server. I expose this port to the internet, and if I go over here to my configuration, you see whenever a call comes into a phone number, it connects to my WebSocket server. This is the I use 11 labs for text-to-speech. This particular idea of the voice, and this is the starting sentence. Let's call it, and let's hope it works. Is the audio? Let me check. That's a great question. Hey, what's the capital of France? That's a great question. You see, I always get the same response back. Why? Because I don't do anything here. I just say return a static response, but you saw how fast the latency was. So, if I look at the logs, you see the text-to-speech, and speech-to-text happened instantly. Now, let's actually involve some AI here, and if I drill down into that, that's auto-imported.

You see, what do I do here? Well, I have an if statement to see do I already know this conversation ID. If I don't know it, I just use the GPT4 or mini model. I add a system prompt, and the prompt of the user. I add a web search tool, and I make sure to store the history on the server side at OpenAI, and then if the response comes back, I save it. So, when I ask another question the next time, the if statement is triggered here, and then I can refer to the previous conversation, so I don't have to carry on that messages array all the time. I just have to add the most recent prompt. Let's try that again. I restart the server.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Building a Voice-Enabled AI Assistant With Javascript
JSNation 2023JSNation 2023
21 min
Building a Voice-Enabled AI Assistant With Javascript
Top Content
This Talk discusses building a voice-activated AI assistant using web APIs and JavaScript. It covers using the Web Speech API for speech recognition and the speech synthesis API for text to speech. The speaker demonstrates how to communicate with the Open AI API and handle the response. The Talk also explores enabling speech recognition and addressing the user. The speaker concludes by mentioning the possibility of creating a product out of the project and using Tauri for native desktop-like experiences.
AI and Web Development: Hype or Reality
JSNation 2023JSNation 2023
24 min
AI and Web Development: Hype or Reality
Top Content
This talk explores the use of AI in web development, including tools like GitHub Copilot and Fig for CLI commands. AI can generate boilerplate code, provide context-aware solutions, and generate dummy data. It can also assist with CSS selectors and regexes, and be integrated into applications. AI is used to enhance the podcast experience by transcribing episodes and providing JSON data. The talk also discusses formatting AI output, crafting requests, and analyzing embeddings for similarity.
The Rise of the AI Engineer
React Summit US 2023React Summit US 2023
30 min
The Rise of the AI Engineer
Top Content
Watch video: The Rise of the AI Engineer
The rise of AI engineers is driven by the demand for AI and the emergence of ML research and engineering organizations. Start-ups are leveraging AI through APIs, resulting in a time-to-market advantage. The future of AI engineering holds promising results, with a focus on AI UX and the role of AI agents. Equity in AI and the central problems of AI engineering require collective efforts to address. The day-to-day life of an AI engineer involves working on products or infrastructure and dealing with specialties and tools specific to the field.
The Ai-Assisted Developer Workflow: Build Faster and Smarter Today
JSNation US 2024JSNation US 2024
31 min
The Ai-Assisted Developer Workflow: Build Faster and Smarter Today
AI is transforming software engineering by using agents to help with coding. Agents can autonomously complete tasks and make decisions based on data. Collaborative AI and automation are opening new possibilities in code generation. Bolt is a powerful tool for troubleshooting, bug fixing, and authentication. Code generation tools like Copilot and Cursor provide support for selecting models and codebase awareness. Cline is a useful extension for website inspection and testing. Guidelines for coding with agents include defining requirements, choosing the right model, and frequent testing. Clear and concise instructions are crucial in AI-generated code. Experienced engineers are still necessary in understanding architecture and problem-solving. Energy consumption insights and sustainability are discussed in the Talk.
Web Apps of the Future With Web AI
JSNation 2024JSNation 2024
32 min
Web Apps of the Future With Web AI
Web AI in JavaScript allows for running machine learning models client-side in a web browser, offering advantages such as privacy, offline capabilities, low latency, and cost savings. Various AI models can be used for tasks like background blur, text toxicity detection, 3D data extraction, face mesh recognition, hand tracking, pose detection, and body segmentation. JavaScript libraries like MediaPipe LLM inference API and Visual Blocks facilitate the use of AI models. Web AI is in its early stages but has the potential to revolutionize web experiences and improve accessibility.
Code coverage with AI
TestJS Summit 2023TestJS Summit 2023
8 min
Code coverage with AI
Premium
Codium is a generative AI assistant for software development that offers code explanation, test generation, and collaboration features. It can generate tests for a GraphQL API in VS Code, improve code coverage, and even document tests. Codium allows analyzing specific code lines, generating tests based on existing ones, and answering code-related questions. It can also provide suggestions for code improvement, help with code refactoring, and assist with writing commit messages.

Workshops on related topic

AI on Demand: Serverless AI
DevOps.js Conf 2024DevOps.js Conf 2024
163 min
AI on Demand: Serverless AI
Top Content
Featured WorkshopFree
Nathan Disidore
Nathan Disidore
In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.
AI for React Developers
React Advanced 2024React Advanced 2024
142 min
AI for React Developers
Top Content
Featured Workshop
Eve Porcello
Eve Porcello
Knowledge of AI tooling is critical for future-proofing the careers of React developers, and the Vercel suite of AI tools is an approachable on-ramp. In this course, we’ll take a closer look at the Vercel AI SDK and how this can help React developers build streaming interfaces with JavaScript and Next.js. We’ll also incorporate additional 3rd party APIs to build and deploy a music visualization app.
Topics:- Creating a React Project with Next.js- Choosing a LLM- Customizing Streaming Interfaces- Building Routes- Creating and Generating Components - Using Hooks (useChat, useCompletion, useActions, etc)
Free webinar: Building Full Stack Apps With Cursor
Productivity Conf for Devs and Tech LeadersProductivity Conf for Devs and Tech Leaders
71 min
Free webinar: Building Full Stack Apps With Cursor
Top Content
WorkshopFree
Mike Mikula
Mike Mikula
In this webinar I’ll cover a repeatable process on how to spin up full stack apps in Cursor.  Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development.  We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more!  By the end expect to be able to run your own ai generated full stack app on your machine!
Working With OpenAI and Prompt Engineering for React Developers
React Advanced 2023React Advanced 2023
98 min
Working With OpenAI and Prompt Engineering for React Developers
Top Content
Workshop
Richard Moss
Richard Moss
In this workshop we'll take a tour of applied AI from the perspective of front end developers, zooming in on the emerging best practices when it comes to working with LLMs to build great products. This workshop is based on learnings from working with the OpenAI API from its debut last November to build out a working MVP which became PowerModeAI (A customer facing ideation and slide creation tool).
In the workshop they'll be a mix of presentation and hands on exercises to cover topics including:
- GPT fundamentals- Pitfalls of LLMs- Prompt engineering best practices and techniques- Using the playground effectively- Installing and configuring the OpenAI SDK- Approaches to working with the API and prompt management- Implementing the API to build an AI powered customer facing application- Fine tuning and embeddings- Emerging best practice on LLMOps
Building AI Applications for the Web
React Day Berlin 2023React Day Berlin 2023
98 min
Building AI Applications for the Web
Workshop
Roy Derks
Roy Derks
Today every developer is using LLMs in different forms and shapes. Lots of products have introduced embedded AI capabilities, and in this workshop you’ll learn how to build your own AI application. No experience in building LLMs or machine learning is needed. Instead, we’ll use web technologies such as JavaScript, React and GraphQL which you already know and love.
Building Your Generative AI Application
React Summit 2024React Summit 2024
82 min
Building Your Generative AI Application
WorkshopFree
Dieter Flick
Dieter Flick
Generative AI is exciting tech enthusiasts and businesses with its vast potential. In this session, we will introduce Retrieval Augmented Generation (RAG), a framework that provides context to Large Language Models (LLMs) without retraining them. We will guide you step-by-step in building your own RAG app, culminating in a fully functional chatbot.
Key Concepts: Generative AI, Retrieval Augmented Generation
Technologies: OpenAI, LangChain, AstraDB Vector Store, Streamlit, Langflow