Click a visible button
describe(hint="foreground browser window")
locate(target="Get notified button in the SAI hero")
click(target="Get notified button in the SAI hero")
describe(hint="dialog opened by Get notified")Developer documents
Conductor exposes SAI's vision-and-action loop as MCP tools. Agents observe with describe, resolve targets with locate, act through OS input, then re-describe to verify progress.
01
1
Call describe with a focused hint, such as foreground browser window.
2
Resolve the visible target before acting.
3
Use click, type_text, scroll, drag, key, or hotkey through OS input.
4
Re-describe the affected area and check whether state changed.
5
If the result is wrong, re-describe, re-locate, and retry with a more specific target.
02
On macOS, Conductor runs as a native app and needs Accessibility permission for input control. The grounder endpoint is configurable: local, self-hosted, or remote.
Step 1 — clear macOS quarantine flag on first run
xattr -dr com.apple.quarantine /Applications/conductor-mcp.appStep 2 — register with Claude Code
claude mcp add conductor-desktop --scope user \
--env CONDUCTOR_MCP_MOBILE_ANDROID="true" \
--env CONDUCTOR_MCP_SCREENSHOT_MAX_WIDTH="1280" \
--env CONDUCTOR_MCP_BACKEND="resident" \
--env CONDUCTOR_MCP_SCREENSHOT_MODE="file" \
--env CONDUCTOR_MCP_AUTO_SCREENSHOT="false" \
--env CONDUCTOR_MCP_GROUNDER_URL="https://your-grounder.example.com" \
-- /Applications/conductor-mcp.app/Contents/MacOS/conductor-mcp03
Conductor uses two separate models with distinct roles. They do not share a context window.
Orchestrating LLM
The model driving the task — Claude, GPT-4o, or any MCP-compatible LLM. It receives prose, labels, coordinates, and state from the grounder, not raw image pixels. It decides what to do next and calls Conductor tools.
Grounder VLM
A vision-language model that turns the screen into structured descriptions and coordinates. Powers describe and locate. Configurable via CONDUCTOR_MCP_GROUNDER_URL. Reference model: Qwen3-VL-4B-Instruct (Q4_K_M, GGUF) served via llama.cpp. The grounder_family config key controls prompt formatting.
04
| describe | Read the visible screen and return prose, labels, state, and context. |
| locate | Resolve a natural-language target to screen coordinates. |
| crop | Inspect a focused visual region when the full screen is too broad. |
| wait_for | Wait until a named UI element appears before continuing. |
| list_windows | Read window titles, focus, bounds, z-order, and modality. |
| get_scene_graph | Inspect the window topology tree. |
| click / double_click / right_click | Mouse actions against semantic targets. |
| click_at / drag_at | Coordinate-based precision actions. |
| type_text | Type into the currently focused field. |
| key / hotkey | Send keyboard keys and shortcuts. |
| drag | Drag from one semantic target to another. |
| scroll | Scroll within a visible region. |
| mouse_move | Move without clicking, usually for hover UI. |
| web_list_tabs | List Chrome tabs when DevTools Protocol is available. |
| web_eval | Run JavaScript in a tab and return a result. |
| web_crop | Capture a DOM element by selector, text, or JavaScript expression. |
| web_mark | Outline an element visually, then use visual tools. |
| mobile_list_devices | List connected Android devices. |
| describe(device_id) | Observe a specific mobile device screen. |
| locate(device_id) | Resolve a mobile UI target to coordinates. |
| mobile_tap / mobile_swipe | Touch input on the device. |
| mobile_type / mobile_key | Text and key input on the device. |
| mobile_app_launch / mobile_shell | Launch Android apps or run adb shell commands. |
05
Conductor includes a persistent knowledge base (SQLite) that agents can read and write across sessions. It stores three types of entries: Skills (repeatable action patterns), Facts (discovered environment details), and Experiences (completed task outcomes). Enable it by setting CONDUCTOR_MCP_KB_PATH.
# Add to your claude mcp add command:
--env CONDUCTOR_MCP_KB_PATH="/path/to/brain.db" \
--env CONDUCTOR_MCP_KB_WRITE_ENABLED="true"| kb_record_skill | Record a repeatable action pattern the agent has learned. |
| kb_record_fact | Store a discovered fact about the environment or UI. |
| kb_record_experience | Log a completed task outcome for future reference. |
| kb_search | Search the KB with a natural-language query. |
| kb_get_skill | Retrieve a specific skill by name. |
| kb_mark_contradicted | Mark a KB entry as outdated or incorrect. |
| kb_brief | Summarise KB contents relevant to the current task. |
kb_write_enabled defaults to false — the agent can read and search the KB but not write new entries unless explicitly enabled. The KB tab in the dashboard shows all stored entries across the three subtabs.
06
When the resident backend is running, a local dashboard is available at http://127.0.0.1:8765/dashboard. The port can be changed via the port config key.
State pillIDLE / RUNNING — live agent stateRoleholder or subordinate agent modeBackendresident or stdioGrounder URLActive grounder endpointKB attachedWhether a knowledge base file is mountedTransportMCP transport in useInput pausedWhether desktop input is currently suspendedPID / UptimeProcess ID and running timeRe-run permission testRe-validates macOS Accessibility accessLive history of every tool call: name, klass (obs for perception tools, act for input/action tools), status (ok / err), duration, timestamp, and truncated args. Most recent first. The system tray icon also shows the last 5 calls.
Browse KB entries across three subtabs: Skills, Facts, Experiences. Requires CONDUCTOR_MCP_KB_PATH to be set.
Live view of all config keys split into hot-reloadable (take effect on next tool call) and restart-required. See the Config reference section below for the full key list.
07
All keys are set as environment variables on the claude mcp add command with the CONDUCTOR_MCP_ prefix (e.g. CONDUCTOR_MCP_TEXT_ONLY=true). Hot-reloadable keys take effect on the next tool call without restarting Conductor.
| Key | Default | Description |
|---|---|---|
| text_only | false | Return text descriptions only — no image payload. Eliminates image tokens from the agent loop. |
| auto_screenshot | false | Capture a screenshot after every tool call. |
| screenshot_max_width | 1280 | Maximum pixel width of screenshots sent to the agent. |
| payload_max_width | 540 | Maximum width of inline image payloads in tool results. |
| tool_timing | false | Append execution duration to every tool result. |
| deltas_enabled | false | Include structural UI delta events (navigations, focus changes, node additions) in tool results. |
| Key | Default | Description |
|---|---|---|
| backend | resident | Transport backend. resident keeps the process alive; stdio restarts per call. |
| grounder_url | — | URL of the VLM grounder endpoint. Supports local (llama.cpp), self-hosted, or remote. |
| grounder_family | qwen | Grounder model family — determines prompt formatting. Options: qwen, openai. |
| grounder_timeout | 30 | Seconds before a grounder request times out. |
| host | 127.0.0.1 | Host the dashboard and resident backend bind to. |
| port | 8765 | Port the dashboard serves on. |
| transport | stdio | MCP transport layer. stdio for Claude Code; sse for other clients. |
| kb_path | ~/.conductor-mcp/brain.db | Path to the SQLite knowledge base file. Required to enable the KB tab and KB tools. |
| kb_write_enabled | false | Allow the agent to write new entries to the KB. False = read-only. |
| Key | Default | Description |
|---|---|---|
| coord_system | norm1000 | Coordinate space used by locate. norm1000 normalises to 0–1000 on each axis. |
| cdp_host / cdp_port | 127.0.0.1 / 9222 | Chrome DevTools Protocol endpoint for web_eval and web_crop. |
| cdp_default_tab_filter | (empty) | Default tab substring filter when no match_url/match_title is passed to web tools. |
| mobile_android_enabled | true | Enable ADB-based Android device support. |
| mobile_ios_enabled | false | Enable iOS device support via WebDriverAgent. |
| mobile_ios_wda_url | (empty) | WebDriverAgent URL for iOS device control. |
| qdrant_url | (empty) | Qdrant vector DB URL for semantic KB search. Leave empty to use SQLite FTS only. |
| qdrant_collection_prefix | conductor | Prefix for Qdrant collection names. |
| tei_dense_url | (empty) | Text Embeddings Inference URL for dense vector embeddings (KB semantic search). |
| tei_sparse_doc_url / tei_sparse_query_url | (empty) | TEI URLs for sparse SPLADE embeddings. |
| screenshot_mode | file | How screenshots are returned: file writes to disk; inline sends base64. |
| screenshot_dir | ~/.conductor-mcp/screenshots | Directory where screenshot files are saved. |
| wait_for_poll_interval | 0.5 | Seconds between polls when wait_for is watching for an element. |
08
Observe with describe before acting.
Use locate for semantic targets.
Prefer click/type/scroll/drag/hotkey through OS input.
After each action, re-describe the affected area and verify progress.
If a click misses, re-describe, re-locate, and retry with a more specific target.09
describe(hint="foreground browser window")
locate(target="Get notified button in the SAI hero")
click(target="Get notified button in the SAI hero")
describe(hint="dialog opened by Get notified")describe(hint="browser page content")
scroll(target="browser page content", direction="down", amount=5)
describe(hint="newly visible section after scrolling")describe(hint="email capture dialog")
click(target="email input in the dialog")
type_text(text="developer@example.com")
click(target="submit button in the dialog")
wait_for(target="success message in the dialog")describe(hint="settings panel")
locate(target="Wi-Fi toggle in the network settings panel")
click(target="Wi-Fi toggle in the network settings panel")
describe(hint="network settings panel")
locate(target="actual Wi-Fi on/off switch, not the sidebar row")
click(target="actual Wi-Fi on/off switch, not the sidebar row")10
If click, scroll, or type actions fail with an input-lock error, another Conductor client owns desktop input. Read-only tools may still work until that client exits.
If Chrome DevTools Protocol is not reachable, use the visual path: describe, locate, click, scroll, and type_text.
Rephrase with more visual context: blue Save button in the top-right toolbar, or actual Wi-Fi switch, not the sidebar row.
Use wait_for with a named element instead of repeatedly clicking while the screen is changing.