voiceChain

voiceChain is a high-performance voice AI framework built from the ground up to run entirely offline on Apple Silicon. It orchestrates STT, LLM, and TTS models into a seamless, parallelized stream, enabling natural, human-like conversations that you can interrupt mid-sentence.

In Progress

Live Demo View Source

Built With

MLX
LLama.cpp
AsyncIO

Technical Breakdown

The core innovation is the asynchronous, multi-service architecture built with Python's AsyncIO. It solves the primary latency problem of voice pipelines by overlapping computation.

STT (Whisper), LLM (Llama.cpp), and TTS (Kokoro) run in parallel managed by a central orchestrator.
The LLM streams tokens as they are generated, which are immediately buffered into sentences.
The TTS engine synthesizes the first sentence while the LLM is still generating the rest of the response, drastically reducing perceived latency.
The entire flow is managed by non-blocking asyncio.Queues, decoupling each stage of the pipeline.

1// In ConversationManager: The main event loop listens for a VAD utterance.
2async def run(self):
3    logger.info("Conversation Manager started. Listening for speech...")
4    while True:
5        user_audio_data = await self.services.user_utterance_queue.get()
6
7        if self.state == AgentState.RESPONDING:
8            is_barge_in = await self.check_for_barge_in(user_audio_data)
9            if is_barge_in: await self.handle_barge_in(user_audio_data)
10        
11        elif self.state == AgentState.IDLE:
12            # ... start new turn ...

To enable natural conversation, the agent must be interruptible. This was solved with a combination of a stateful Conversation Manager and software-based echo cancellation.

The agent is always listening, even while speaking (state == RESPONDING).
If new speech is detected, it is transcribed in a low-priority thread.
A pragmatic echo check compares the transcribed interruption to the agent's current speech to differentiate user barge-in from acoustic echo.
A valid barge-in triggers an immediate, non-blocking interruption sequence: the audio player is silenced, the current processing pipeline is cancelled, and a new pipeline is started for the user's new input.

1// In ConversationManager: The core barge-in logic.
2async def handle_barge_in(self, audio_data: np.ndarray):
3    logger.warning("Handling barge-in...")
4    if self.active_pipeline_task:
5        self.active_pipeline_task.cancel()
6    
7    await self.services.player.interrupt()
8    self.services.reset_executors() # Hard reset to prevent GPU conflicts
9    await self.start_new_turn(audio_data)

The final application was refactored from a monolithic script into a clean, reusable, and pip-installable library called `voiceChain`.

1
2# In examples/run_agent.py:
3# The "Composition Root" cleanly instantiates and wires together all components.
4
5# 1. Instantiate Services
6services = ServiceManager(loop)
7
8# 2. Instantiate Core AI Engines
9transcriber = Transcriber(...)
10tts_engine = TextToSpeechEngine(...)
11llm_engine = LLMEngine(...)
12
13# 3. Instantiate and Run the Conversation Manager
14manager = ConversationManager(
15    services=services,
16    transcriber=transcriber,
17    llm_engine=llm_engine,
18    tts_engine=tts_engine
19)
20
21await services.start()
22await manager.run()