Voice AI is revolutionizing how we interact with technology, transforming everything from customer service to personal assistants. But building a real-time, voice-to-voice AI pipeline that feels natural and responsive remains a complex challenge. Enter the Voice Sandwich Demo from LangChain—an innovative open-source project that showcases how to build production-ready voice AI applications using modern AI frameworks and streaming architecture.
This comprehensive guide explores the Voice Sandwich Demo, a sophisticated real-time voice assistant designed for a sandwich shop ordering system. Whether you’re a data engineer looking to expand into AI applications or a developer curious about voice AI implementation, this project offers valuable insights into building responsive, conversational AI systems.
What Is the Voice Sandwich Demo?
The Voice Sandwich Demo is an open-source project from LangChain that demonstrates a complete voice-to-voice AI pipeline. Built to simulate a sandwich shop order assistant, it showcases how to combine multiple AI services into a seamless conversational experience. The project supports both TypeScript and Python implementations, making it accessible to a wide range of developers.
Key Technologies and Services
The demo leverages several cutting-edge AI services working in harmony:
- LangChain/LangGraph Agents: Provides the intelligent reasoning and decision-making capabilities
- AssemblyAI: Handles real-time speech-to-text conversion with high accuracy
- Cartesia: Delivers natural-sounding text-to-speech synthesis
- Anthropic Claude: Powers the underlying language model for understanding and responding to orders
Why This Architecture Matters
Traditional voice assistants often feel clunky because they process requests sequentially—waiting for complete utterances before responding. The Voice Sandwich Demo uses an async generator pattern with producer-consumer architecture, enabling true real-time streaming. This means the system can start processing and responding while you’re still speaking, creating a more natural conversational flow.
Understanding the Pipeline Architecture
The magic behind this voice AI system lies in its three-stage pipeline architecture. Each stage is implemented as an async generator that transforms a stream of events, allowing for low-latency, real-time processing.
Stage 1: Speech-to-Text (STT) Stream
The first stage captures audio from the user’s microphone and streams it to AssemblyAI for transcription. Rather than waiting for complete sentences, it yields partial transcriptions in real-time, providing immediate feedback to the user. This creates the illusion of the system “listening” as you speak.
The STT stage generates two types of events:
stt_chunk: Partial transcriptions sent to the client for real-time feedbackstt_output: Final, complete transcriptions passed to the agent stage
Stage 2: Agent Processing Stream
Once transcription is complete, the agent stage invokes a LangChain agent powered by Claude. This stage is where the intelligence happens—understanding the customer’s order, asking clarifying questions, and managing the conversation state. The agent can even call tools to check inventory, calculate prices, or process orders.
The agent stream produces several event types:
agent_chunk: Text chunks from the agent’s responsetool_call: When the agent invokes a function or tooltool_result: Results from tool executionsagent_end: Signals the completion of the agent’s response
Stage 3: Text-to-Speech (TTS) Stream
The final stage takes the agent’s text response and converts it to natural-sounding speech using Cartesia’s API. Like the other stages, this happens in a streaming fashion—the system starts speaking as soon as the first chunks of text are available, rather than waiting for the complete response.
The TTS stage outputs tts_chunk events containing audio data that’s immediately played back through the browser’s speakers, completing the voice-to-voice loop.
Technical Implementation: TypeScript vs Python
One of the most interesting aspects of the Voice Sandwich Demo is that it provides complete implementations in both TypeScript and Python, allowing developers to choose based on their stack preferences or learn by comparing approaches.
TypeScript Implementation
The Node.js implementation uses modern JavaScript features and the pnpm package manager. It’s built with async iterators and WebSocket connections, making it ideal for developers already working in JavaScript ecosystems or building web-first applications.
The TypeScript version includes:
- A Svelte-based web frontend for clean, reactive UI
- WebSocket server for real-time bidirectional communication
- Modular design with separate clients for AssemblyAI, Cartesia, and even an alternate ElevenLabs TTS option
- Hot reload support for rapid development iteration
Python Implementation
For data engineers and ML practitioners more comfortable with Python, the Python implementation offers the same functionality using modern Python async features. Built with uv (a fast Python package manager), it provides an excellent foundation for integrating with existing data pipelines or ML workflows.
The Python version features:
- Type-safe event definitions for clear contract between components
- Async generator patterns that mirror the TypeScript approach
- Easy integration with Python-based data tools and frameworks
- Clean separation of concerns with dedicated modules for each service
Real-World Applications and Use Cases
While the demo focuses on a sandwich shop scenario, the architecture and patterns demonstrated have broad applicability across industries. Understanding these use cases can help you envision how to adapt this technology for your own projects.
Customer Service Automation
Replace traditional IVR systems with intelligent voice agents that understand natural language, access customer records in real-time, and resolve issues without human intervention. The streaming architecture ensures customers don’t experience awkward pauses or delays.
Accessibility Solutions
Voice interfaces can make applications accessible to users with visual impairments or motor disabilities. The real-time feedback and natural conversation flow make these systems significantly more usable than traditional screen readers or command-based interfaces.
Drive-Through and Kiosk Ordering
The sandwich shop demo isn’t just a toy example—it’s directly applicable to quick-service restaurants looking to automate ordering. The ability to handle complex, multi-step orders while maintaining conversation context makes this ideal for food service applications.
Voice-Controlled IoT and Smart Devices
Integrate this pipeline into smart home systems, industrial control interfaces, or vehicle systems where hands-free operation is essential. The low latency ensures commands are executed promptly, critical for safety and user satisfaction.
Getting Started: Running the Demo Yourself
The Voice Sandwich Demo is designed to be developer-friendly with minimal setup friction. Here’s how to get it running on your local machine.
Prerequisites and API Keys
Before diving in, you’ll need to set up accounts and obtain API keys for three services:
- AssemblyAI API Key: Sign up at AssemblyAI for speech-to-text capabilities
- Cartesia API Key: Register with Cartesia for text-to-speech synthesis
- Anthropic API Key: Get access to Claude through Anthropic’s console
All three services offer free tiers or trial credits, making it easy to experiment without upfront costs.
Quick Start with Make
The project includes a Makefile that simplifies the entire setup process. If you have Make installed, getting started is as simple as:
First, clone the repository and navigate to it. Then run make bootstrap to install all dependencies for both the TypeScript and Python implementations. Finally, launch your preferred implementation with make dev-ts for TypeScript or make dev-py for Python.
The application will be accessible at localhost:8000, where you can immediately start testing the voice interface.
Manual Setup Options
If you prefer more control or want to understand the individual components, you can set up each implementation manually. For TypeScript, navigate to the components directory, install dependencies with pnpm, build the web frontend, and start the server. For Python, use uv to sync dependencies and run the main application file.
Extending and Customizing the Demo
The real value of this project lies in its extensibility. The modular architecture makes it straightforward to swap components, add new capabilities, or integrate with existing systems.
Switching TTS Providers
The project includes support for both Cartesia and ElevenLabs for text-to-speech. You can easily switch between providers or add new ones by implementing the same event interface. This flexibility allows you to optimize for cost, voice quality, or latency based on your specific needs.
Adding Custom Tools and Functions
The LangChain agent can be extended with custom tools—functions the AI can call to perform specific actions. For a real sandwich shop, this might include checking ingredient availability, processing payments, or updating order status in a database. For data engineering applications, you could connect to data warehouses, trigger analytics jobs, or query real-time metrics.
Integrating with Existing Systems
The event-driven architecture makes integration straightforward. You can consume events to log conversations, update databases, trigger workflows, or feed into analytics pipelines. The WebSocket interface can be adapted to work with existing web applications or mobile apps.
Performance Considerations and Best Practices
When building production voice AI systems, performance becomes critical. Here are key considerations drawn from the Voice Sandwich Demo’s architecture.
Latency Optimization
The streaming architecture minimizes perceived latency by overlapping operations. While the TTS is generating audio for the first part of the response, the agent is already working on the next section. Choose API regions close to your users and consider caching common responses to further reduce latency.
Error Handling and Resilience
Production systems need robust error handling. The demo shows patterns for managing failures in any stage of the pipeline without crashing the entire system. Implement retry logic for transient API failures, provide fallback responses when services are unavailable, and ensure graceful degradation.
Cost Management
Voice AI can be expensive at scale. Monitor API usage across all services, implement request throttling to prevent runaway costs, and consider caching or pre-generating responses for common queries. For data engineering use cases, batch processing might be more cost-effective than real-time for non-critical workflows.
Conclusion: The Future of Voice AI Development
The Voice Sandwich Demo represents a significant step forward in making voice AI accessible to developers. By providing complete, production-quality implementations in both TypeScript and Python, LangChain has created a valuable learning resource and foundation for building real-world applications.
For data engineers, this project opens new possibilities for creating voice-driven data interfaces, enabling natural language interaction with analytics systems, or building accessible data exploration tools. The streaming architecture patterns demonstrated here can be applied far beyond voice AI—they’re relevant for any real-time processing pipeline.
The key takeaways are clear: modern voice AI doesn’t require massive infrastructure or specialized expertise. With the right combination of services and architectural patterns, you can build responsive, natural-feeling voice interfaces that deliver genuine value to users. The async generator pattern enables true streaming processing, making experiences feel instantaneous. And modular design allows you to adapt and extend the system as requirements evolve.
Ready to build your own voice AI application? Start by cloning the Voice Sandwich Demo repository, experiment with the implementations, and adapt the patterns to your specific use case. Whether you’re automating customer service, building accessibility tools, or creating voice-controlled data interfaces, this project provides a solid foundation for innovation.
