Related papers: ANCHOR: Branch-Point Data Generation for GUI Agents

ANCHOR: Branch-Point Data Generation for GUI Agents

URL: http://arxiv.org/abs/2602.07153v1
Date: Fri, 06 Feb 2026 19:55:26 GMT
Title: ANCHOR: Branch-Point Data Generation for GUI Agents
Authors: Jinbiao Wei, Yilun Zhao, Kangqi Ni, Arman Cohan,
Abstract summary: End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
Score: 52.22377425487
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations. Starting from each seed, we identify branch points that correspond to meaningful state changes and propose new, state-grounded task variants conditioned on the current GUI context. An executing agent then follows the proposed instructions to generate new trajectories, while a verifier enforces task completion via state-aware checks and trajectory-level consistency. To improve supervision quality, we further apply task-conditioned step-level filtering to remove ungrounded actions and denoise post-branch segments to maintain coherent intent. Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements over zero-shot agents and representative synthesis baselines, and generalize across applications and operating systems.

Related papers

Constitutional Black-Box Monitoring for Scheming in LLM Agents [1.4619913143519836]
We use language models to examine agent behaviors for suspicious actions.<n>We study constitutional black-box monitors that detect scheming using only externally observable inputs and outputs.<n>We find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization.
arXiv Detail & Related papers (2026-02-28T22:31:32Z)
Computer-Using World Model [58.59112582915026]
We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next user interface (UI) state.<n> CUWM first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot.<n>We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution.
arXiv Detail & Related papers (2026-02-19T13:48:29Z)
AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis [30.512393568258105]
Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data.<n>We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains.
arXiv Detail & Related papers (2026-02-10T03:21:42Z)
GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z)
ProBench: Benchmarking GUI Agents with Accurate Process Information [15.519853892615272]
We introduce ProBench, a comprehensive benchmark with over 200 challenging GUI tasks covering widely-used scenarios.<n>We extend our dataset to include Process-related Task and design a specialized evaluation method.<n>Our evaluation of advanced GUI agents reveals significant limitations for real-world GUI scenarios.
arXiv Detail & Related papers (2025-11-12T09:49:31Z)
GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z)
XBOUND: Exploring Capability Boundaries of Device-Control Agents at the State Level [43.73689966281675]
Device-Control Agents (DC agents) manage graphical user interfaces (GUIs)<n>We propose a new evaluation method, XBOUND, to evaluate the accuracy of instruction completion on a per-state basis.<n>Our evaluation yields several key insights: UI-TARS stands out as the strongest 7B model, current agents display a bimodal performance pattern in instruction unification, and sub-7B models remain limited in state mastery.
arXiv Detail & Related papers (2025-05-27T14:49:30Z)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z)
Robust Object Detection via Instance-Level Temporal Cycle Confusion [89.1027433760578]
We study the effectiveness of auxiliary self-supervised tasks to improve the out-of-distribution generalization of object detectors. Inspired by the principle of maximum entropy, we introduce a novel self-supervised task, instance-level temporal cycle confusion (CycConf) For each object, the task is to find the most different object proposals in the adjacent frame in a video and then cycle back to itself for self-supervision.
arXiv Detail & Related papers (2021-04-16T21:35:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.