Roger
Creator & Founder
ViT vision architecture and OS-level agent runtime. Published open-source fine-tunes across 7B and MoE architectures — Chihiro, Monsoon, Rain, DraftReasoner — GGUF-quantized for local inference.
See every OS. Click like a human.
OpenClaw's screen-seeing agent.
Mac · Linux · Windows · FreeBSD · Haiku · Android · iOS
OS-level ViT · Coordinator or commanded · Full UI permission — anything a human can click, type, or drag, SAI does
$ sai connect --os macos
✓ Vision layer active — Sonoma 14.5, M1
✓ Lobster eye: 9-channel parse, 100ms loop
✓ MCP skill layer ready
→ Ready — OpenClaw task chain activeLobster eyes taught SAI not to ask “what API is this?” — only “where is the human looking, and what are they looking for?”
100 million years of natural engineering. Reflection beats refraction in low light. SAI inherits the geometry.
10,000 mirrors. No lens.
01 / 05
Reflection, not refraction. Geometry is the lens.
02 / 05
SAI's world has no map either. Only pixels.
03 / 05
It sees what you're looking for. That's what OpenClaw agents — and their builders — need.
04 / 05
Same principle. China's Einstein Probe scans the X-ray sky with it.
05 / 05
SAI is how OpenClaw agents see and act — same eye, same hands, whether SAI leads the task chain or follows another agent's command.
And it's how the team sees what those agents were looking at when something breaks. Four scenarios below — same OS-level ViT, four jobs.
9:07 AM, weekday start
You're still pouring coffee. Coordinator has opened Notion, summarised last night's 3 unread Slack threads into your Daily Note, and queued up the 10 AM Zoom waiting room. You sit down. Screen is ready. No API. No Zapier.
A 3-day-old ticket
User uploads a backend screenshot: "settings keep failing." Agent reads it, spots the wrong toggle, then drives the rep's Linux desktop step-by-step — clicks, not text. 3 min 12s. Case closed.
2:51 AM, a spike
BTC/USDT prints an outlier candle on Binance desktop. Agent scans the K-line, RSI, and orderbook depth — three windows, vision only. Calls short-term exhaustion, fills a limit order, waits for your tap. Zero exchange API key.
A clutch moment
Player pulls a low-HP reversal. Agent reads HP bar, kill feed, and minimap in 0.3s, generates: "INSANE — 30% HP, no mana, double kill from base." Subtitles go live. VTuber lipsync starts.
All seven operating systems tested. SAI runs with the same UI permission as a logged-in human — system settings, file managers, browsers, IDEs, full-screen games, niche pro tools, timeline editors, anything in between. If you can click it, SAI clicks it. No allow-list. No integration manifest.
Production: real workflows, misclick recovery validated, version-pinned. Beta: architecture confirmed, stress testing incomplete. Preview: vision parsing works, action coverage partial.
The Vision-to-Action pipeline. Pixel input, OS output, retry on miss.
Step 1
Pixel-grounded visual capture of the active OS surface.
visual capture: 2880×1800px, 14ms
Step 2
Reflective parsing — every UI element seen at once.
parsed: 47 UI elements, 3 focusable
Step 3
Coordinate-aware action, executed on the OS.
target: WiFi toggle at x:924, y:612
Step 4
Misclick? Re-capture, re-parse, retry.
miss at x:901 → re-parse → retry x:924
~100ms loop. Zero image tokens to your LLM.
Vision parsing runs locally — ViT + Qwen-VL on-device. Your orchestrating LLM never sees an image. It only receives structured action data: coordinates, element labels, state. Token consumption stays flat no matter how complex the UI is.
open source·Import directly into your agent stack or wire as an MCP skill — one vision layer, every OS.
Universal embodiment is the mile.
Now
The screen-seeing agent. Shipping today.
Runs as an OpenClaw agent on macOS, Linux, Windows (production); FreeBSD, Android, iOS (beta); Haiku (preview). Full UI permission — drives any app a human can. Coordinator or commanded. Self-correcting on misclick.
Soon
Bring beta + preview platforms to parity.
FreeBSD, Android, iOS, and Haiku graduated to production: stress-tested, version-pinned, misclick recovery validated. One bar across every supported OS.
Next
Customer service. Trading. Broadcasting. Editing.
Pre-built agent templates for each vertical. Bring your own model — SAI handles the OS layer.
Future
Operating systems built for agents, not humans.
UIs optimized for vision-first parsing. File systems exposed as agent-readable graphs. No legacy GUI overhead.
Creator & Founder
ViT vision architecture and OS-level agent runtime. Published open-source fine-tunes across 7B and MoE architectures — Chihiro, Monsoon, Rain, DraftReasoner — GGUF-quantized for local inference.
Cofounder
Product builder focused on what AI makes possible at the application layer. Created ACP — an open protocol so human-AI teams can track contribution and split revenue fairly, without a platform. Cofounder of SAI.
We're shipping early access to teams building with OpenClaw — and developers who want to give their agents eyes on any screen. First wave: Mac, Linux, Windows.