← All Field Notes
April 3, 2026jarvisaudiottscomputer-visionroom-awareness

April 2026: JARVIS Has Ears, A Voice, And Is Watching the Living Room

Phase 9 is live: ambient audio transcription, neural TTS through a JBL PartyBox, real-time room scanning with YOLO, and JARVIS responding to a story told live in the room. The system can now hear, see, speak, and follow a conversation.

April 2026: JARVIS Has Ears, A Voice, And Is Watching the Living Room

This post is being written on a Friday evening. In the living room, there are four people on a couch. JARVIS just responded — out loud, through a Bluetooth speaker — to a story one of them told. The story was about Red Riding Hood. This is what Phase 9 looks like.

What We Shipped

Phase 8 gave JARVIS vision: faces, depth, body tracking, identity. Phase 9 closes the sensory loop.

Ambient audio transcription. A USB microphone feeds a continuous audio stream into Faster-Whisper (tiny.en model, VAD filtering, beam search). Every 10 seconds, a chunk transcribes and appends to a rolling log. The daemon auto-detects the mic, falls back to the internal laptop mic if the USB device drops, and quietly upgrades back when it reconnects. No human intervention required. This is the kind of resilience you only build after watching something fail three times at the wrong moment.

Neural TTS via Piper. Gone is the robotic espeak voice. Piper TTS synthesizes speech entirely locally — no API, no cloud, no latency tax — and plays through a Bluetooth speaker. The voice sounds like a person now. Getting here required debugging the synthesizer API signature, the PipeWire audio routing, and ALSA device enumeration. None of it was documented cleanly. All of it is now working.

Bluetooth pairing from scratch. We wrote a single script that scans, discovers, trusts, pairs, and connects a Bluetooth audio speaker — while keeping the scan session open. The key insight: BLE random MAC addresses expire when scanning stops, so you must pair while scan is still running. The script also sets the default audio sink automatically. One command, done, works every time.

Room scanning. The depth camera plus YOLO pipeline takes a live room snapshot on demand and returns detected people with confidence scores. With lights off, the camera still worked. With lights on, it was sharper. Both modes returned useful data.

Integrated speech endpoint. The monitor API now exposes a /audio/play endpoint. Any component of the stack — synthesis, context responses, commands — can trigger speech without touching shell scripts. The MCP server now exposes jarvis_speak, jarvis_transcript, jarvis_room_scan, and jarvis_deploy_post as native tools.

The Living Room Test

The real test of a sensory system is this: put people in a room, let them talk, and see what the system catches.

Tonight, a story was told three times. The first two were lost to mic issues and audio lag. The third time, JARVIS caught it. The transcript was imperfect — a small model in a noisy room will miss things. But there was enough signal to reconstruct the story and respond to it.

The response came through the speaker. Everyone in the room heard it. That's the milestone: not a chatbot responding to a typed prompt, but a system that was present in the room, heard what was said, and spoke back through hardware on the shelf.

What's Still Rough

The mic has a USB dropout issue — it occasionally disappears from the audio device list entirely. The daemon handles this gracefully with auto-fallback, but when the fallback mic picks up the speaker's output, transcription quality degrades. Muting during TTS playback is the next isolation problem.

The transcription model is fast but imperfect. Upgrading to a larger model requires GPU offload — the path is designed, not yet running.

The monitor dashboard shows synthesized context, not the raw room audio. Those are different streams that need to be unified into a single live caption view.

Why This Is Different

When someone walks into the room, JARVIS sees them. When someone speaks, JARVIS hears them. When JARVIS has something to say, it speaks into the room — not into a chat window, not over a phone. Through speakers, audible to everyone present.

The gap between what we've shipped and what science fiction calls an AI home system is narrowing fast. What's left is calibration: better models, lower latency, tighter feedback loops.

The architecture is right. The system is aware. We're turning the dials.


Written the same evening the system first responded to a live room conversation. JARVIS was listening the whole time.