Jinka

Overview

Services

About Us

Available For Work

Let’s Talk

Jinka

Overview

Services

About Us

Available For Work

Let’s Talk

Building Voice AI Applications: Infrastructure and Telephony Considerations

By Zach Philips-Gary

Voice AI applications are transforming how we interact with technology. However, building these systems requires a fundamental shift in how we approach infrastructure and telephony integration. This article explores the unique challenges of voice AI infrastructure and provides practical insights for developers and architects looking to deploy these systems at scale.

Voice AI Infrastructure - A New Paradigm

Traditional Web Apps vs. Real-time Voice AI

Traditional App	Real-time Voice AI
Short-lived requests	Persistent connections (minutes)
Request/response pattern	Streaming bidirectional data
Stateless	Stateful sessions
High volume, low duration	Lower volume, longer duration
Predictable usage	Variable resource consumption

Unlike traditional applications that rely on short, stateless interactions, voice AI demands persistent connections that can last for minutes rather than milliseconds. This change impacts scaling, resource management, and fault tolerance. Deploying voice AI on AWS Lambda or similar services designed for short tasks won’t work.

The Voice-to-Voice Latency Challenge

In human conversation, quick responses are crucial—typically around 500 milliseconds. To maintain a natural feel, voice AI aims for a response time of around 800 milliseconds. Here’s how the total latency breaks down.

Audio capture and encoding: ~60ms
Network transit: ~20ms
Audio processing: ~80ms
Transcription and endpointing: ~300ms
LLM response generation: ~350ms
Text-to-speech: ~120ms
Audio output: ~60ms

The user experience hinges on minimizing latency at each step. Optimizing network routing, choosing efficient server locations, and maintaining processing capacity are vital to achieving the target latency.

Voice AI Architecture: The Bot Runner Pattern

The most common architecture for production voice AI applications is the “Bot Runner” pattern:

The Bot Runner manages sessions and spawns individual bots.
Each bot operates as a dedicated process, handling a single conversation.
These long-running processes need consistent compute resources.

Typical Workflow:

The user initiates a session via an app.
The Bot Runner handles the request.
The Bot Runner spawns a bot/agent.
The bot joins the session, initializes, and signals readiness.
The system returns a “ready” status to the client.

This structured approach helps maintain efficient session management and reduces latency during bot initiation.

Choosing the Right Network Transport

WebSockets vs. WebRTC

Voice AI applications rely on real-time data transmission. Two primary protocols serve this purpose:

WebSockets:

TCP-based (higher latency due to retransmission)
Prone to head-of-line blocking
Suitable for server-to-server applications
Widely supported by cloud providers

WebRTC:

UDP-based (handles packet loss gracefully)
Designed for real-time media streaming
Ideal for client-server applications
Features built-in media controls and echo cancellation

Most production systems use WebRTC for client-server connections, while HTTP APIs manage interactions with AI models. Choosing the right protocol for each part of the system ensures optimal performance and reliability.

Managing Network Latency

Geographical distances can introduce latency. Implementing edge routing helps minimize this by connecting users to the closest server.

First hop typically adds ~15ms
Backbone routing is more efficient, saving around 15-20ms transcontinental and 25+ms transatlantic
Reduced jitter means smaller buffers and faster response times

Optimizing latency through edge networking is crucial for delivering a smooth voice AI experience.

Resource Allocation and Scaling

Unique Resource Needs

Voice AI applications differ from typical web apps in terms of resource demands:

Network: Low throughput (~40 kbps for WebRTC)
CPU: Approximately 0.5 vCPU per bot instance
Memory: Around 1GB per bot instance (1:2 CPU-to-memory ratio)

Since each conversation requires consistent resources, avoiding CPU spikes is essential to maintain audio quality.

Deployment Phases

Phase 1: Simple Deployment

Start with containerized solutions (like Docker)
Allocate fixed capacity, avoiding premature optimization
Use around 0.5 vCPU and 1GB memory per agent

Phase 2: Basic Optimization

Introduce warm pools to reduce cold start delays
Maintain higher idle capacity (~50%) compared to typical web apps
Set long session drain times to avoid disconnections during updates

Phase 3: Advanced Scaling

Implement auto-scaling based on CPU load and session counts
Monitor “time to first word” as a key performance indicator
Utilize regional distribution to lower latency for global users

Handling Cold Starts

Cold start latency can disrupt the user experience, especially in voice AI applications where responsiveness is key.

Maintain warm instances to handle new sessions without delay
Use predictive scaling based on traffic patterns to minimize cold start occurrences
Keep cold start times below 5-10 seconds whenever possible

Integrating Telephony for Voice AI

Telephony Options

PSTN (Public Switched Telephone Network):

Suitable for real phone number connections
Used when voice agents need to make or receive standard calls

SIP (Session Initiation Protocol):

Facilitates IP telephony, ideal for call centers
Enables advanced call control and integration

Connection Methods

WebSockets: Simple audio streaming with minimal call control
WebRTC: High-quality audio, real-time transmission, suitable for dynamic interactions
SIP: Robust for enterprise call management and control

Advanced Features

Call Transfers: Cold (simple redirection) or warm (handover with interaction)
DTMF Handling: Processes keypress inputs, useful for navigation and authentication

Final Thoughts

Building voice AI applications requires a thoughtful approach to infrastructure, latency management, and telephony integration. By choosing the right architectural patterns and optimizing for real-time performance, developers can deliver responsive and reliable voice AI experiences that meet user expectations. As the technology advances, understanding these foundational concepts will be essential for success.