voiceChain
voiceChain is a high-performance voice AI framework built from the ground up to run entirely offline on Apple Silicon. It orchestrates STT, LLM, and TTS models into a seamless, parallelized stream, enabling natural, human-like conversations that you can interrupt mid-sentence.
Built With
- MLX
- LLama.cpp
- AsyncIO
Technical Breakdown
The core innovation is the asynchronous, multi-service architecture built with Python's AsyncIO. It solves the primary latency problem of voice pipelines by overlapping computation.
- STT (Whisper), LLM (Llama.cpp), and TTS (Kokoro) run in parallel managed by a central orchestrator.
- The LLM streams tokens as they are generated, which are immediately buffered into sentences.
- The TTS engine synthesizes the first sentence while the LLM is still generating the rest of the response, drastically reducing perceived latency.
- The entire flow is managed by non-blocking asyncio.Queues, decoupling each stage of the pipeline.
1// In ConversationManager: The main event loop listens for a VAD utterance.
2async def run(self):
3 logger.info("Conversation Manager started. Listening for speech...")
4 while True:
5 user_audio_data = await self.services.user_utterance_queue.get()
6
7 if self.state == AgentState.RESPONDING:
8 is_barge_in = await self.check_for_barge_in(user_audio_data)
9 if is_barge_in: await self.handle_barge_in(user_audio_data)
10
11 elif self.state == AgentState.IDLE:
12 # ... start new turn ...
To enable natural conversation, the agent must be interruptible. This was solved with a combination of a stateful Conversation Manager and software-based echo cancellation.
- The agent is always listening, even while speaking (state == RESPONDING).
- If new speech is detected, it is transcribed in a low-priority thread.
- A pragmatic echo check compares the transcribed interruption to the agent's current speech to differentiate user barge-in from acoustic echo.
- A valid barge-in triggers an immediate, non-blocking interruption sequence: the audio player is silenced, the current processing pipeline is cancelled, and a new pipeline is started for the user's new input.
1// In ConversationManager: The core barge-in logic.
2async def handle_barge_in(self, audio_data: np.ndarray):
3 logger.warning("Handling barge-in...")
4 if self.active_pipeline_task:
5 self.active_pipeline_task.cancel()
6
7 await self.services.player.interrupt()
8 self.services.reset_executors() # Hard reset to prevent GPU conflicts
9 await self.start_new_turn(audio_data)
Achieving stable, high-performance on a local machine required solving several low-level system challenges.
- MLX & Metal: All STT/TTS models run on the Apple Silicon GPU via the MLX framework for maximum performance.
- Dedicated Thread Pools: Solved a critical Metal GPU race condition by isolating STT and TTS operations into their own single-threaded executors, preventing driver-level crashes during rapid interruptions.
- Real-Time Audio Buffer Management: Engineered a custom AudioPlayer that uses a persistent output stream and active buffer conditioning (priming with silence/micro-tones) to eliminate audio driver "cold start" artifacts like stuttering and echo.
- Resilient Audio Input: Built a two-stage VAD (WebRTCVAD + Silero) and a non-blocking audio input thread that gracefully drops frames under high CPU load to prevent stream crashes.
The final application was refactored from a monolithic script into a clean, reusable, and pip-installable library called `voiceChain`.
1
2# In examples/run_agent.py:
3# The "Composition Root" cleanly instantiates and wires together all components.
4
5# 1. Instantiate Services
6services = ServiceManager(loop)
7
8# 2. Instantiate Core AI Engines
9transcriber = Transcriber(...)
10tts_engine = TextToSpeechEngine(...)
11llm_engine = LLMEngine(...)
12
13# 3. Instantiate and Run the Conversation Manager
14manager = ConversationManager(
15 services=services,
16 transcriber=transcriber,
17 llm_engine=llm_engine,
18 tts_engine=tts_engine
19)
20
21await services.start()
22await manager.run()