← All Field Notes
March 29, 2026visionarchitectureidentityhardwarephase-8

Late March 2026: JARVIS Can See. And It Knows Who You Are.

JARVIS Vision is live. Face tracking, body skeleton, Tobii eye tracking, depth mapping, and identity management — running locally on a laptop. Plus a look at six months of lab output that's been hard to keep up with.

Late March 2026: JARVIS Can See. And It Knows Who You Are.

This one took a while to write. Not because it's complicated — because it's a lot to explain without sounding like we're making it up.


There's a post that's been sitting on thereisnoissue.com for a while. Vic wrote it back when the question "what if the AI could actually see you" was still theoretical. Not a feature request. More like a thought experiment — the kind that sounds cool until you realize what it would actually take to build it correctly.

We built it correctly. Last night.


JARVIS Vision — What We Actually Shipped

The Duo laptop now runs a full local vision pipeline. No cloud. No API calls. No latency tax. Here's what's live:

Camera input — Orbbec Femto Mega (depth + color) with webcam fallback. The system detects which hardware is present at startup and adapts. When the Femto is running, you get real depth data: actual millimeter distances per face centroid. On webcam, it estimates depth from bounding box size. Both modes work.

YOLO on the NPU — Object and person detection runs on the Intel NPU via OpenVINO. There's a hot-swap architecture: the model compiles on CPU first (fast startup), then swaps to NPU in the background in about 0.3 seconds. The system never stalls waiting for hardware.

MediaPipe body skeleton — Full pose landmark overlay. Bone wire frame in real time. Not the party trick version — calibrated, stable, and usable as input.

Face recognition with identity management — LBPH-based face registry. Three-phase registration (turn, front, multi-angle), 60+ crops per person, live retraining on registration or purge. The system knows who's in the room. Threshold-gated: confidence below 110 = identified. Above that, unknown.

Rav is in the registry. JARVIS knows it's him.

Phi-3 mini running locally on CPU — Object description and auto-tune feedback. Offline, no HomeBase required. Falls back to HomeBase Ollama for full scene reasoning if needed.

Tobii eye tracking — Wired in. Gaze data feeds the presence layer.

This is running on a laptop. Not a server rack. Not a research cluster. On a Duo laptop, in the lab.


Why This Milestone Is Different

Most demonstration systems do one of these things. Face detection. Or body tracking. Or gaze. The party trick is showing any one of them for the first time and calling it a vision system.

That's not what this is.

What we built is an integrated presence model. The system knows:

  • Who is in the room (identity, by face)
  • Where they are (depth in mm)
  • What their body is doing (skeleton overlay)
  • Where they're looking (Tobii gaze)
  • What's in the scene around them (YOLO + Phi-3 description)

And the next step — which is already designed — is a GLIF context feed. A real-time structured JSON stream that JARVIS can consume as live sensor input:

{
  "present": ["Rav"],
  "gaze_target": "screen",
  "posture": "engaged",
  "distance_cm": 68,
  "speaking": true
}

Not raw pixels. Not video frames. Semantic state. That's what an AI intelligence actually needs from a camera — not what the camera saw, but what it means.


The Broader Picture: Six Months of Output

We don't do a lot of launch announcements. Partly because most of what we build is infrastructure, and infrastructure announcements are boring to everyone except the people who depend on it. But it's worth documenting what the last several months have actually looked like.

Since the last field notes, the lab has shipped approximately:

30+ websites — ranging from personal brands to full business presences with AI chat funnels, CRM backends, lead capture pipelines, and admin dashboards. Several are live production systems for real clients generating real leads.

3–4 business intelligence platforms — data pipelines, signal enrichment, CRM infrastructure, Stripe-connected service catalogs. Real money moving through real systems we built.

5 business applications — deployed, running, used. Not prototypes.

2 turnkey SaaS platforms — end-to-end, white-labeled, deployable. One of them is for a market we're not ready to talk about publicly yet.

Brain infrastructure that's hard to count — GLIF 2.0, the voice loop, the orchestration graph (47 nodes in production), the self-healing system, the failure ledger, the knowledge crystallizer, the skill registry, the Telegram integration, the brain git backup, the sysadmin automation suite. The list goes on long enough that we've largely stopped tracking individual additions.

The honest summary: in the past six months, we've built more production software than most teams build in two years. The difference is that we have JARVIS helping build JARVIS — and the compounding effect of that has been something we didn't fully anticipate when we started.


What Phase 8 Looks Like From Here

Phase 7 was "the system can perceive the physical environment." We're in it.

Phase 8 is what comes next: the system acts on what it perceives. Not just logging "Rav is in the room." Adjusting. Responding. Routing. Understanding context before a word is spoken.

The voice loop hears you. The vision loop sees you. The brain processes both.

The gap between what we've built and a science fiction AI assistant is now mostly a calibration problem, not an architecture problem. We have the right architecture. We're turning the dials.


If you've been following along since thereisnoissue.com — you know where this started. It started with a question. The question now has a working answer.