NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset
- URL: http://arxiv.org/abs/2508.01330v2
- Date: Thu, 07 Aug 2025 09:42:28 GMT
- Title: NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset
- Authors: Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen,
- Abstract summary: We introduce NaturalGAIA, a novel benchmark engineered on the principle of Causal Pathways.<n>This paradigm structures complex tasks into a series of verifiable atomic steps, ensuring rigorous, fully automated, and reproducible standard for assessment.<n>We then utilize this dataset to perform Reinforcement FineTuning (RFT) on the Q2.5-VL-7B model.
- Score: 16.676904484703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of Large Language Model (LLM)-driven Graphical User Interface (GUI) agents is significantly hampered by the profound limitations of existing evaluation benchmarks in terms of accuracy, reproducibility, and scalability. To address this critical gap, we introduce NaturalGAIA, a novel benchmark engineered on the principle of Causal Pathways. This design paradigm structures complex tasks into a series of programmatically verifiable atomic steps, ensuring a rigorous, fully automated, and reproducible standard for assessment. Concurrently, to mitigate the inherent capability deficits of agents, we developed LightManus, a hierarchical agent architecture specifically optimized for long-horizon tasks. We leveraged this agent to generate a high-quality, human-verified trajectory dataset that uniquely captures diverse and even self-correcting interaction patterns of LLMs. We then utilized this dataset to perform Reinforcement Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Our experiments reveal that NaturalGAIA presents a formidable challenge to current state-of-the-art LLMs; even the top-performing Claude-sonnet-4 achieved a Weighted Pathway Success Rate (WPSR) of only 34.6%. Moreover, while RFT substantially improved the smaller model's GUI execution capabilities (WPSR increased from 3.3% to 10.8%), its performance degraded sharply when handling complex scenarios. This outcome highlights the inherent capability ceiling of smaller models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This research contributes a rigorous evaluation standard and a high-quality dataset to the community, aiming to guide the future development of GUI agents.
Related papers
- Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models [15.709482146201283]
A simple linear classifier trained on the frozen features of modern Vision Foundation Models establishes a new state-of-the-art.<n>We show that this baseline matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets.<n>We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.
arXiv Detail & Related papers (2026-02-02T07:20:02Z) - FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution [87.57784204422218]
Reinforcement Learning with Human Feedback has proven effective in image generation field guided by reward models to align human preferences.<n>We propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on ancoder-Decoder architecture.<n>While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects.
arXiv Detail & Related papers (2025-12-27T16:55:21Z) - IRPO: Boosting Image Restoration via Post-training GRPO [59.588079259093035]
We propose IRPO, a low-level GRPO-based post-training paradigm.<n>We first explore a data formulation principle for low-level post-training paradigm.<n>We then model a reward-level criteria system that balances objective accuracy and human perceptual preference.
arXiv Detail & Related papers (2025-11-30T09:42:24Z) - Pushing the Boundaries of Interpretability: Incremental Enhancements to the Explainable Boosting Machine [1.2461503242570642]
This paper aims at improving the Explainable Boosting Machine (EBM), a state-of-the-art glassbox model that delivers both high accuracy and complete transparency.<n>The work is positioned as a critical step toward developing machine learning systems that are robust, equitable, and transparent.
arXiv Detail & Related papers (2025-11-29T15:46:13Z) - EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence [17.644658293987955]
Embodied AI agents are capable of robust spatial perception, effective task planning, and adaptive execution in physical environments.<n>Current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations.<n>We propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes.
arXiv Detail & Related papers (2025-10-23T14:05:55Z) - Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z) - LLMShot: Reducing snapshot testing maintenance via LLMs [0.5218155982819203]
Snapshot testing has emerged as a critical technique for UI validation in modern software development.<n>This paper introduces LLMShot, a novel framework that leverages Vision-Language Models (VLMs) to automatically analyze snapshot test failures.
arXiv Detail & Related papers (2025-07-14T08:47:19Z) - GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies [34.63675989928621]
We introduce GUI-Robust, a novel dataset designed for comprehensive GUI agent evaluation.<n>We also propose a semi-automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools.<n>This paradigm significantly reduces annotation time cost by a factor of over 19 times.<n>We assess state-of-the-art GUI agents using the GUI-Robust dataset, revealing their substantial performance degradation in abnormal scenarios.
arXiv Detail & Related papers (2025-06-17T12:50:35Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents [37.871793585090586]
We introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents.<n> verification of trajectory outcome is challenging and high-quality training data are not scalable.<n>We show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks.
arXiv Detail & Related papers (2025-05-27T17:58:06Z) - Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning [28.291759852111586]
We introduce a reinforcement learning (RL) based framework that incorporates three core strategies to improve GUI agent performance.<n>With only 3k training samples, our 7B- parameter model achieves state-of-the-art results among similarly sized models.<n> Notably, it attains 47.3% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2%.
arXiv Detail & Related papers (2025-05-18T11:22:04Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z) - Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [44.34340798542]
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning.
Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities.
We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions.
arXiv Detail & Related papers (2024-08-13T20:52:13Z) - Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.<n>The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.<n>We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.