Building Voice AI Applications: Infrastructure and Telephony Considerations

Building Voice AI Applications: Infrastructure and Telephony Considerations

By Zach Philips-Gary

Voice AI applications are transforming how we interact with technology. However, building these systems requires a fundamental shift in how we approach infrastructure and telephony integration. This article explores the unique challenges of voice AI infrastructure and provides practical insights for developers and architects looking to deploy these systems at scale.


Voice AI Infrastructure - A New Paradigm

Traditional Web Apps vs. Real-time Voice AI

Traditional App

Real-time Voice AI

Short-lived requests

Persistent connections (minutes)

Request/response pattern

Streaming bidirectional data

Stateless

Stateful sessions

High volume, low duration

Lower volume, longer duration

Predictable usage

Variable resource consumption

Unlike traditional applications that rely on short, stateless interactions, voice AI demands persistent connections that can last for minutes rather than milliseconds. This change impacts scaling, resource management, and fault tolerance. Deploying voice AI on AWS Lambda or similar services designed for short tasks won’t work.


The Voice-to-Voice Latency Challenge

In human conversation, quick responses are crucial—typically around 500 milliseconds. To maintain a natural feel, voice AI aims for a response time of around 800 milliseconds. Here’s how the total latency breaks down.

  • Audio capture and encoding: ~60ms

  • Network transit: ~20ms

  • Audio processing: ~80ms

  • Transcription and endpointing: ~300ms

  • LLM response generation: ~350ms

  • Text-to-speech: ~120ms

  • Audio output: ~60ms

The user experience hinges on minimizing latency at each step. Optimizing network routing, choosing efficient server locations, and maintaining processing capacity are vital to achieving the target latency.


Voice AI Architecture: The Bot Runner Pattern

The most common architecture for production voice AI applications is the “Bot Runner” pattern:

  1. The Bot Runner manages sessions and spawns individual bots.

  2. Each bot operates as a dedicated process, handling a single conversation.

  3. These long-running processes need consistent compute resources.

Typical Workflow:

  • The user initiates a session via an app.

  • The Bot Runner handles the request.

  • The Bot Runner spawns a bot/agent.

  • The bot joins the session, initializes, and signals readiness.

  • The system returns a “ready” status to the client.

This structured approach helps maintain efficient session management and reduces latency during bot initiation.

Choosing the Right Network Transport

WebSockets vs. WebRTC

Voice AI applications rely on real-time data transmission. Two primary protocols serve this purpose:

WebSockets:

  • TCP-based (higher latency due to retransmission)

  • Prone to head-of-line blocking

  • Suitable for server-to-server applications

  • Widely supported by cloud providers

WebRTC:

  • UDP-based (handles packet loss gracefully)

  • Designed for real-time media streaming

  • Ideal for client-server applications

  • Features built-in media controls and echo cancellation

Most production systems use WebRTC for client-server connections, while HTTP APIs manage interactions with AI models. Choosing the right protocol for each part of the system ensures optimal performance and reliability.

Managing Network Latency

Geographical distances can introduce latency. Implementing edge routing helps minimize this by connecting users to the closest server.

  • First hop typically adds ~15ms

  • Backbone routing is more efficient, saving around 15-20ms transcontinental and 25+ms transatlantic

  • Reduced jitter means smaller buffers and faster response times

Optimizing latency through edge networking is crucial for delivering a smooth voice AI experience.


Resource Allocation and Scaling

Unique Resource Needs

Voice AI applications differ from typical web apps in terms of resource demands:

  • Network: Low throughput (~40 kbps for WebRTC)

  • CPU: Approximately 0.5 vCPU per bot instance

  • Memory: Around 1GB per bot instance (1:2 CPU-to-memory ratio)

Since each conversation requires consistent resources, avoiding CPU spikes is essential to maintain audio quality.


Deployment Phases

Phase 1: Simple Deployment

  • Start with containerized solutions (like Docker)

  • Allocate fixed capacity, avoiding premature optimization

  • Use around 0.5 vCPU and 1GB memory per agent

Phase 2: Basic Optimization

  • Introduce warm pools to reduce cold start delays

  • Maintain higher idle capacity (~50%) compared to typical web apps

  • Set long session drain times to avoid disconnections during updates

Phase 3: Advanced Scaling

  • Implement auto-scaling based on CPU load and session counts

  • Monitor “time to first word” as a key performance indicator

  • Utilize regional distribution to lower latency for global users


Handling Cold Starts

Cold start latency can disrupt the user experience, especially in voice AI applications where responsiveness is key.

  • Maintain warm instances to handle new sessions without delay

  • Use predictive scaling based on traffic patterns to minimize cold start occurrences

  • Keep cold start times below 5-10 seconds whenever possible


Integrating Telephony for Voice AI

Telephony Options

PSTN (Public Switched Telephone Network):

  • Suitable for real phone number connections

  • Used when voice agents need to make or receive standard calls

SIP (Session Initiation Protocol):

  • Facilitates IP telephony, ideal for call centers

  • Enables advanced call control and integration


Connection Methods

  • WebSockets: Simple audio streaming with minimal call control

  • WebRTC: High-quality audio, real-time transmission, suitable for dynamic interactions

  • SIP: Robust for enterprise call management and control


Advanced Features

  • Call Transfers: Cold (simple redirection) or warm (handover with interaction)

  • DTMF Handling: Processes keypress inputs, useful for navigation and authentication


Final Thoughts

Building voice AI applications requires a thoughtful approach to infrastructure, latency management, and telephony integration. By choosing the right architectural patterns and optimizing for real-time performance, developers can deliver responsive and reliable voice AI experiences that meet user expectations. As the technology advances, understanding these foundational concepts will be essential for success.

Jinka