By Zach Philips-Gary
Voice AI applications are transforming how we interact with technology. However, building these systems requires a fundamental shift in how we approach infrastructure and telephony integration. This article explores the unique challenges of voice AI infrastructure and provides practical insights for developers and architects looking to deploy these systems at scale.
Voice AI Infrastructure - A New Paradigm
Traditional Web Apps vs. Real-time Voice AI
Traditional App | Real-time Voice AI |
Short-lived requests | Persistent connections (minutes) |
Request/response pattern | Streaming bidirectional data |
Stateless | Stateful sessions |
High volume, low duration | Lower volume, longer duration |
Predictable usage | Variable resource consumption |
Unlike traditional applications that rely on short, stateless interactions, voice AI demands persistent connections that can last for minutes rather than milliseconds. This change impacts scaling, resource management, and fault tolerance. Deploying voice AI on AWS Lambda or similar services designed for short tasks won’t work.
The Voice-to-Voice Latency Challenge
In human conversation, quick responses are crucial—typically around 500 milliseconds. To maintain a natural feel, voice AI aims for a response time of around 800 milliseconds. Here’s how the total latency breaks down.
Audio capture and encoding: ~60ms
Network transit: ~20ms
Audio processing: ~80ms
Transcription and endpointing: ~300ms
LLM response generation: ~350ms
Text-to-speech: ~120ms
Audio output: ~60ms
The user experience hinges on minimizing latency at each step. Optimizing network routing, choosing efficient server locations, and maintaining processing capacity are vital to achieving the target latency.
Voice AI Architecture: The Bot Runner Pattern
The most common architecture for production voice AI applications is the “Bot Runner” pattern:
The Bot Runner manages sessions and spawns individual bots.
Each bot operates as a dedicated process, handling a single conversation.
These long-running processes need consistent compute resources.
Typical Workflow:
The user initiates a session via an app.
The Bot Runner handles the request.
The Bot Runner spawns a bot/agent.
The bot joins the session, initializes, and signals readiness.
The system returns a “ready” status to the client.
This structured approach helps maintain efficient session management and reduces latency during bot initiation.
Choosing the Right Network Transport
WebSockets vs. WebRTC
Voice AI applications rely on real-time data transmission. Two primary protocols serve this purpose:
WebSockets:
TCP-based (higher latency due to retransmission)
Prone to head-of-line blocking
Suitable for server-to-server applications
Widely supported by cloud providers
WebRTC:
UDP-based (handles packet loss gracefully)
Designed for real-time media streaming
Ideal for client-server applications
Features built-in media controls and echo cancellation
Most production systems use WebRTC for client-server connections, while HTTP APIs manage interactions with AI models. Choosing the right protocol for each part of the system ensures optimal performance and reliability.
Managing Network Latency
Geographical distances can introduce latency. Implementing edge routing helps minimize this by connecting users to the closest server.
First hop typically adds ~15ms
Backbone routing is more efficient, saving around 15-20ms transcontinental and 25+ms transatlantic
Reduced jitter means smaller buffers and faster response times
Optimizing latency through edge networking is crucial for delivering a smooth voice AI experience.
Resource Allocation and Scaling
Unique Resource Needs
Voice AI applications differ from typical web apps in terms of resource demands:
Network: Low throughput (~40 kbps for WebRTC)
CPU: Approximately 0.5 vCPU per bot instance
Memory: Around 1GB per bot instance (1:2 CPU-to-memory ratio)
Since each conversation requires consistent resources, avoiding CPU spikes is essential to maintain audio quality.
Deployment Phases
Phase 1: Simple Deployment
Start with containerized solutions (like Docker)
Allocate fixed capacity, avoiding premature optimization
Use around 0.5 vCPU and 1GB memory per agent
Phase 2: Basic Optimization
Introduce warm pools to reduce cold start delays
Maintain higher idle capacity (~50%) compared to typical web apps
Set long session drain times to avoid disconnections during updates
Phase 3: Advanced Scaling
Implement auto-scaling based on CPU load and session counts
Monitor “time to first word” as a key performance indicator
Utilize regional distribution to lower latency for global users
Handling Cold Starts
Cold start latency can disrupt the user experience, especially in voice AI applications where responsiveness is key.
Maintain warm instances to handle new sessions without delay
Use predictive scaling based on traffic patterns to minimize cold start occurrences
Keep cold start times below 5-10 seconds whenever possible
Integrating Telephony for Voice AI
Telephony Options
PSTN (Public Switched Telephone Network):
Suitable for real phone number connections
Used when voice agents need to make or receive standard calls
SIP (Session Initiation Protocol):
Facilitates IP telephony, ideal for call centers
Enables advanced call control and integration
Connection Methods
WebSockets: Simple audio streaming with minimal call control
WebRTC: High-quality audio, real-time transmission, suitable for dynamic interactions
SIP: Robust for enterprise call management and control
Advanced Features
Call Transfers: Cold (simple redirection) or warm (handover with interaction)
DTMF Handling: Processes keypress inputs, useful for navigation and authentication
Final Thoughts
Building voice AI applications requires a thoughtful approach to infrastructure, latency management, and telephony integration. By choosing the right architectural patterns and optimizing for real-time performance, developers can deliver responsive and reliable voice AI experiences that meet user expectations. As the technology advances, understanding these foundational concepts will be essential for success.