tools / browser.py
Native headless-browser automation via Playwright — Tor-aware, DNS-leak-safe, session-persistent.
Why a native tool
Before this tool, headless-browser usage was LLM-authored raw Playwright in execute(stateful=True). Four recurring footguns came with that path:
- API shape drift:
await async_playwright().start()vsasync with, top-levelawaitrules, cleanup order. Any slip crashed the Jupyter kernel. - DNS leaks: Chromium's SOCKS5 proxy tunnels traffic but can still resolve names via the container's
/etc/resolvunless--host-resolver-rulesis also set. Easy to miss. - WebRTC IP leaks: STUN/TURN can expose the host IP even over Tor.
- Concurrency races: "store
browser/pageglobally" broke when two plan branches both used the kernel.
The browser tool bakes correct-by-construction defaults for all four. Raw Playwright remains available in execute for advanced flows (response interception, custom wait conditions, file uploads).
Tool: browser
Six operations. The first five are atomic: each launches a fresh Chromium context and re-navigates via the .last_url sidecar. The sixth, interact, runs a sequence of sub-actions in a single context so transient DOM state survives between steps — required for multi-step SPA flows.
| operation | required | returns |
|---|---|---|
navigate | url | HTTP status, final URL, page title |
extract_text | — | Rendered innerText of body (or a CSS selector), truncated at max_chars |
click | selector | Post-click URL + title |
screenshot | — | Writes PNG to out_path inside /workspace; returns /api/download/ URL |
close | — | Deletes the persistent profile so the next session starts fresh |
interact | actions | Per-action results (ok/error), final URL, final title. One Chromium context across the whole list. |
Atomic ops vs. interact
The atomic ops are perfect for single-page scrapes. They are not suitable for interactive multi-step flows, because every op's fresh context discards the JS-driven DOM mutations made by the previous op. A click that opens a modal is invisible to the next click, which sees a page just-loaded from its initial HTML. The 2026-04-23 WebOS eval hit this wall: the agent couldn't drive the Calculator because the Calculator window vanished between every click.
interact solves this by taking a whole sequence in one call:
browser(operation="interact",
url="file:///workspace/webos/index.html",
actions=[
{"action": "click", "selector": "[data-app='calc']"},
{"action": "wait_for_selector", "selector": "#calc-display"},
{"action": "click", "selector": "#calc-btn-7"},
{"action": "click", "selector": "#calc-btn-plus"},
{"action": "click", "selector": "#calc-btn-3"},
{"action": "click", "selector": "#calc-btn-eq"},
{"action": "extract_text", "selector": "#calc-display"},
{"action": "screenshot", "out_path": "calc_result.png"},
])
Sub-action types:
{"action":"goto","url":"...","wait_until":"load"}— navigate mid-sequence (e.g. follow a link into a subpage).{"action":"click","selector":"...","wait_for_hidden":"#overlay","force":false}— see "Stability guard" below for the optional fields.{"action":"dblclick","selector":"..."}— required forondblclick-bound UIs (desktop-icon launchers etc.). A plainclicknever fires dblclick listeners; the 2026-04-24 webOS session produced 8 identical screenshots because the agent fired single clicks on.desktop-iconelements whoseopenApp()handler was wired to dblclick only. Accepts the samewait_for_hidden/forceoptions asclick.{"action":"extract_text","selector":"...","max_chars":65536}— each extraction is preserved in the results so you can read multiple points in the flow.{"action":"fill","selector":"input[name='email']","text":"a@b","wait_for_hidden":"#busy"}— real text input (atomic ops don't support typing). Same overlay-wait option asclick.{"action":"wait_for_selector","selector":"...","state":"visible|hidden|attached|detached","timeout_ms":5000}—statedefaults tovisible; usehiddento wait for an overlay to GO AWAY (the LLM-facing missing primitive that drove the 2026-04-26 incident).{"action":"screenshot","out_path":"..."}— sandbox-safety + host→container path rewriting is applied per sub-action.{"action":"sleep","ms":500}— when you need an explicit pause and there's no selector to wait on.
Stability guard: wait_for_hidden + force + wait_for_selector(state) (2026-04-26)
Bare page.click() auto-waits for its TARGET to be actionable, but knows nothing about an unrelated overlay still intercepting clicks. The 2026-04-26 webOS session lost ~70 minutes to this exact shape:
actions=[
{"action": "click", "selector": "#unlock-btn"}, # JS sets #lock-screen.style.display='none'
{"action": "click", "selector": "#start-btn"} # ← fails: lock screen still intercepts
]
Three new options give the LLM a clean way to express "wait for the overlay to leave, then click":
wait_for_hiddenonclick/dblclick/fill: a CSS selector for the overlay that must disappear before the action fires. Internally callspage.wait_for_selector(overlay, state="hidden")first; if the overlay does NOT go away in time, the action fails withwait_for_hidden(...) timed out before click(...)— a much clearer signal than Playwright's generic "element intercepts pointer events". An optionalwait_for_hidden_ms(defaultmin(5000, op.timeout_ms)) tunes the wait.forceonclick/dblclick: skips Playwright's actionability check (visibility, stability, hit-test). Use sparingly — meant for cases where the LLM is certain the target is correct but Playwright is being conservative (mid-CSS-transition, or a target whose stability oracle is overzealous).stateonwait_for_selector: defaults tovisible(Playwright's own default, the prior behaviour). New valueshidden/attached/detachedmirror Playwright's API. Barewait_for_selector(sel)waited for the selector to APPEAR — useless for an element that's already in the DOM and just needs to fade out, which was the missing primitive that drove the incident.
Recommended pattern for a flow with a fading overlay:
actions=[
{"action": "click", "selector": "#unlock-btn"},
{"action": "wait_for_selector", "selector": "#lock-screen", "state": "hidden"},
{"action": "dblclick", "selector": ".desktop-icon[data-app='calc']"},
{"action": "screenshot", "out_path": "calc.png"},
]
Coverage: tests/test_browser_interact_stability_guard.py.
By default (stop_on_error=false) a failing sub-action is recorded and the sequence keeps going — so you get full visibility into which step is flaky. Pass stop_on_error=true to abort on first failure.
The subprocess wall-clock budget for interact scales with action count (timeout_ms × len(actions)), so a 20-step flow doesn't hit the single-op default cap.
Session persistence
Cookies, localStorage, and auth tokens survive across tool calls via Chromium's launch_persistent_context(user_data_dir=/workspace/.browser_profile). Each op opens → runs → closes cleanly; there is no long-lived subprocess to manage. Call operation="close" to wipe the profile (e.g. after logging out, or when switching user contexts).
Chained-op continuity (.last_url sidecar)
launch_persistent_context restores cookies/localStorage but NOT open pages — each op's browser process is a fresh launch. Without extra plumbing, a "step 2 continues where step 1 left off" call pattern would land on about:blank and every selector query would fail (surfaced by the 2026-04-23 eval, which then caused the LLM to burn a turn misattributing the error to a DOMContentLoaded race).
To let the LLM chain ops ergonomically, the runner writes the final URL of every successful navigation to <profile_dir>/.last_url. Any subsequent extract_text/click/screenshot without an explicit url reads the sidecar and re-navigates first. Typical usage:
browser(operation="navigate", url="https://example.com/login")
browser(operation="click", selector="#login-btn") # no url — uses sidecar
browser(operation="extract_text", selector=".welcome") # no url — uses sidecar
browser(operation="screenshot", out_path="landed.png") # no url — uses sidecar
browser(operation="close") # wipes profile + sidecar
When neither an explicit url nor a sidecar URL is available, the runner errors out with extract_text needs a URL: pass `url=...` or call `operation="navigate"` first — no silent query against about:blank.
The sidecar lives inside the profile dir, so operation="close" (which rmtree's the whole directory) wipes it for free. Empty / None URLs are never written to the sidecar, so a failed op can't blow away the last-known-good URL.
Tor & DNS-leak hardening
When the sandbox was started with a tor_proxy, every op forwards it as Chromium's --proxy-server. The in-sandbox runner (_chromium_args) always adds, alongside the Docker-safe flags:
--host-resolver-rules="MAP * ~NOTFOUND , EXCLUDE localhost"— forces DNS through the SOCKS proxy;EXCLUDE localhostkeeps the self-play fixture server + in-container services reachable.--webrtc-ip-handling-policy=disable_non_proxied_udp— blocks WebRTC STUN/TURN egress that could expose the host IP.--disable-features=WebRtcHideLocalIpsWithMdns— pairs with the above.
Chromium only accepts the socks5:// URL scheme. If the caller supplies socks5h:// (some repo helpers do, for httpx compatibility), the tool auto-rewrites — DNS-over-proxy is enforced via --host-resolver-rules regardless of URL scheme.
Runner protocol
A tiny script .browser_runner.py is written into /workspace on every call and invoked as python3 .browser_runner.py <op_json>. It emits exactly one sentinel line:
[BROWSER_OK] {...json payload...}
[BROWSER_ERR] <message>
The tool scans stdout for the sentinel so Chromium's unrelated warnings (GTK, ALSA, libpci) don't corrupt the result.
Return shape
Human-readable, deterministic per op, so downstream prompts can pattern-match:
--- BROWSER RESULT ---
STATUS: OK
OP: navigate
URL: https://example.com/
HTTP_STATUS: 200
TITLE: Example Domain
Failure modes surfaced
- Unknown operation →
STATUS: ERRORwith valid-ops list. - Runner exits non-zero with a sentinel → the error is surfaced verbatim, plus a timeout/install hint.
- Chromium crashes before printing a sentinel → raw tail of stderr is returned so the agent has something to debug with.
Interact abort semantics (navigation failures are terminal)
Under the default stop_on_error=False, a per-action failure (e.g. a click on a missing selector) is recorded and the loop continues — useful for "try all these selectors, tell me which matched" exploratory flows.
Navigation failures are the one exception: they always abort the sequence, regardless of stop_on_error. A page.goto(...) that raises (ERR_FILE_NOT_FOUND, ERR_CONNECTION_REFUSED, DNS failure, …) leaves Chromium on an error page; every subsequent click / fill / extract_text would just wait the full per-action timeout for elements that don't exist. Before the fix, a 54-action sequence whose first goto 404'd hung for ~108 minutes (54 × 120 s per-action timeout) before the outer subprocess watchdog killed it.
The fix: op_interact catches the goto exception, records aborted_sequence: True on the result entry, and breaks out of the loop immediately. The agent-facing output gains a banner:
⚠ SEQUENCE ABORTED: goto_failed. Remaining actions were NOT executed
because the initial navigation failed — page.click/fill/extract on an
error page would have just timed out one-by-one. Fix the URL and
retry the whole interact call.
…so the next-turn planner reads the failure as "bad URL, retry the whole interact" rather than misinterpreting it as 53 mysterious per-action selector bugs. The return dict grows two fields: aborted: bool and abort_reason: "goto_failed" | "initial_goto_failed" | None. Non-navigation failures still honour the stop_on_error contract. Covered by tests/test_browser_interact_abort.py (9 cases including initial-goto failure, explicit-goto failure, implicit-nav failure, mid-sequence goto, stop_on_error preservation, and happy path).
Related
- execute — raw Playwright fallback path.
- search — for static-HTML fetches; cheaper than a full browser render.
- sandbox/docker — Chromium install + Tor daemon bootstrap.