ANCHOR: Branch-Point Data Generation for GUI Agents
- URL: http://arxiv.org/abs/2602.07153v1
- Date: Fri, 06 Feb 2026 19:55:26 GMT
- Title: ANCHOR: Branch-Point Data Generation for GUI Agents
- Authors: Jinbiao Wei, Yilun Zhao, Kangqi Ni, Arman Cohan,
- Abstract summary: End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
- Score: 52.22377425487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations. Starting from each seed, we identify branch points that correspond to meaningful state changes and propose new, state-grounded task variants conditioned on the current GUI context. An executing agent then follows the proposed instructions to generate new trajectories, while a verifier enforces task completion via state-aware checks and trajectory-level consistency. To improve supervision quality, we further apply task-conditioned step-level filtering to remove ungrounded actions and denoise post-branch segments to maintain coherent intent. Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements over zero-shot agents and representative synthesis baselines, and generalize across applications and operating systems.
Related papers
- Constitutional Black-Box Monitoring for Scheming in LLM Agents [1.4619913143519836]
We use language models to examine agent behaviors for suspicious actions.<n>We study constitutional black-box monitors that detect scheming using only externally observable inputs and outputs.<n>We find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization.
arXiv Detail & Related papers (2026-02-28T22:31:32Z) - Computer-Using World Model [58.59112582915026]
We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next user interface (UI) state.<n> CUWM first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot.<n>We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution.
arXiv Detail & Related papers (2026-02-19T13:48:29Z) - AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis [30.512393568258105]
Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data.<n>We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains.
arXiv Detail & Related papers (2026-02-10T03:21:42Z) - GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z) - ProBench: Benchmarking GUI Agents with Accurate Process Information [15.519853892615272]
We introduce ProBench, a comprehensive benchmark with over 200 challenging GUI tasks covering widely-used scenarios.<n>We extend our dataset to include Process-related Task and design a specialized evaluation method.<n>Our evaluation of advanced GUI agents reveals significant limitations for real-world GUI scenarios.
arXiv Detail & Related papers (2025-11-12T09:49:31Z) - GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z) - XBOUND: Exploring Capability Boundaries of Device-Control Agents at the State Level [43.73689966281675]
Device-Control Agents (DC agents) manage graphical user interfaces (GUIs)<n>We propose a new evaluation method, XBOUND, to evaluate the accuracy of instruction completion on a per-state basis.<n>Our evaluation yields several key insights: UI-TARS stands out as the strongest 7B model, current agents display a bimodal performance pattern in instruction unification, and sub-7B models remain limited in state mastery.
arXiv Detail & Related papers (2025-05-27T14:49:30Z) - OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z) - Robust Object Detection via Instance-Level Temporal Cycle Confusion [89.1027433760578]
We study the effectiveness of auxiliary self-supervised tasks to improve the out-of-distribution generalization of object detectors.
Inspired by the principle of maximum entropy, we introduce a novel self-supervised task, instance-level temporal cycle confusion (CycConf)
For each object, the task is to find the most different object proposals in the adjacent frame in a video and then cycle back to itself for self-supervision.
arXiv Detail & Related papers (2021-04-16T21:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.