Lobster optics. Inside OpenClaw.

See every OS. Click like a human.

OpenClaw's screen-seeing agent.

Mac · Linux · Windows · FreeBSD · Haiku · Android · iOS

OS-level ViT · Coordinator or commanded · Full UI permission — anything a human can click, type, or drag, SAI does

$ sai connect --os macos
 Vision layer active — Sonoma 14.5, M1
 Lobster eye: 9-channel parse, 100ms loop
 MCP skill layer ready
 Ready — OpenClaw task chain active

Why lobster optics?

Lobster eyes taught SAI not to ask “what API is this?” — only “where is the human looking, and what are they looking for?”

100 million years of natural engineering. Reflection beats refraction in low light. SAI inherits the geometry.

10,000 mirrors. No lens.

01 / 05

Reflection, not refraction. Geometry is the lens.

02 / 05

SAI's world has no map either. Only pixels.

03 / 05

It sees what you're looking for. That's what OpenClaw agents — and their builders — need.

04 / 05

Same principle. China's Einstein Probe scans the X-ray sky with it.

05 / 05

One eye inside OpenClaw. Every OS.

SAI is how OpenClaw agents see and act — same eye, same hands, whether SAI leads the task chain or follows another agent's command.

And it's how the team sees what those agents were looking at when something breaks. Four scenarios below — same OS-level ViT, four jobs.

Coordinator

9:07 AM, weekday start

macOS
3 apps, 0 APIsavg 12s/task

You're still pouring coffee. Coordinator has opened Notion, summarised last night's 3 unread Slack threads into your Daily Note, and queued up the 10 AM Zoom waiting room. You sit down. Screen is ready. No API. No Zapier.

Customer Service

A 3-day-old ticket

Linux
screenshot-to-fix in <4 minLinux + Mac + Windows

User uploads a backend screenshot: "settings keep failing." Agent reads it, spots the wrong toggle, then drives the rep's Linux desktop step-by-step — clicks, not text. 3 min 12s. Case closed.

Trader

2:51 AM, a spike

Windows
reads 3 windows/loopno exchange API key

BTC/USDT prints an outlier candle on Binance desktop. Agent scans the K-line, RSI, and orderbook depth — three windows, vision only. Calls short-term exhaustion, fills a limit order, waits for your tap. Zero exchange API key.

Broadcaster

A clutch moment

Windows
real-time commentaryany game, no SDK

Player pulls a low-HP reversal. Agent reads HP bar, kill feed, and minimap in 0.3s, generates: "INSANE — 30% HP, no mana, double kill from base." Subtitles go live. VTuber lipsync starts.

90%+ of the world's OS surface, one eye.

All seven operating systems tested. SAI runs with the same UI permission as a logged-in human — system settings, file managers, browsers, IDEs, full-screen games, niche pro tools, timeline editors, anything in between. If you can click it, SAI clicks it. No allow-list. No integration manifest.

Production: real workflows, misclick recovery validated, version-pinned. Beta: architecture confirmed, stress testing incomplete. Preview: vision parsing works, action coverage partial.

macOSSonoma+LinuxUbuntu 22.04+Windows11+FreeBSD15.0HaikuR1/beta4iOS17+Android13+SAI

See. Decide. Act. Self-correct.

The Vision-to-Action pipeline. Pixel input, OS output, retry on miss.

Step 1

Capture screen

Pixel-grounded visual capture of the active OS surface.

visual capture: 2880×1800px, 14ms

Step 2

Lobster-eye vision

Reflective parsing — every UI element seen at once.

parsed: 47 UI elements, 3 focusable

x: 248, y: 412

Step 3

Click decision

Coordinate-aware action, executed on the OS.

target: WiFi toggle at x:924, y:612

Step 4

Self-correct

Misclick? Re-capture, re-parse, retry.

miss at x:901 → re-parse → retry x:924

~100ms loop. Zero image tokens to your LLM.

Vision parsing runs locally — ViT + Qwen-VL on-device. Your orchestrating LLM never sees an image. It only receives structured action data: coordinates, element labels, state. Token consumption stays flat no matter how complex the UI is.

open source·Import directly into your agent stack or wire as an MCP skill — one vision layer, every OS.

The eye is the first inch.

Universal embodiment is the mile.

Now

SAI inside OpenClaw

The screen-seeing agent. Shipping today.

Runs as an OpenClaw agent on macOS, Linux, Windows (production); FreeBSD, Android, iOS (beta); Haiku (preview). Full UI permission — drives any app a human can. Coordinator or commanded. Self-correcting on misclick.

Soon

All 7 OS to production

Bring beta + preview platforms to parity.

FreeBSD, Android, iOS, and Haiku graduated to production: stress-tested, version-pinned, misclick recovery validated. One bar across every supported OS.

Next

Vertical Agents

Customer service. Trading. Broadcasting. Editing.

Pre-built agent templates for each vertical. Bring your own model — SAI handles the OS layer.

Future

Agent OS

Operating systems built for agents, not humans.

UIs optimized for vision-first parsing. File systems exposed as agent-readable graphs. No legacy GUI overhead.

Built by

Roger

Roger

Creator & Founder

ViT vision architecture and OS-level agent runtime. Published open-source fine-tunes across 7B and MoE architectures — Chihiro, Monsoon, Rain, DraftReasoner — GGUF-quantized for local inference.

YMOW

YMOW

Cofounder

Product builder focused on what AI makes possible at the application layer. Created ACP — an open protocol so human-AI teams can track contribution and split revenue fairly, without a platform. Cofounder of SAI.

Everywhere.

Questions.

Be first to use lobster eyes.

We're shipping early access to teams building with OpenClaw — and developers who want to give their agents eyes on any screen. First wave: Mac, Linux, Windows.