Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms
- URL: http://arxiv.org/abs/2510.13913v1
- Date: Wed, 15 Oct 2025 06:34:46 GMT
- Title: Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms
- Authors: Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang, Caiming Xiong, Shafiq Joty,
- Abstract summary: Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools.<n>These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning.<n>We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing complexity.
- Score: 81.90219895125178
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.
Related papers
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents [23.583947864141162]
EigenData is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers.<n>Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training.<n>Our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
arXiv Detail & Related papers (2026-01-30T06:01:23Z) - SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback [68.60326181052658]
We propose an agentic pipeline that automatically generates high quality, difficulty-controlled deep search question-answer pairs.<n>Our pipeline, SAGE, consists of a data generator which proposes QA pairs and a search agent which attempts to solve the generated question.<n>Our intrinsic evaluation shows SAGE generates questions that require diverse reasoning strategies, while significantly increases the correctness and difficulty of the generated data.
arXiv Detail & Related papers (2026-01-26T06:37:56Z) - Autonomous Data Agents: A New Opportunity for Smart Data [50.02229219403014]
Report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems.<n>DataAgents transform complex and unstructured data into coherent and actionable knowledge.<n>We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend.
arXiv Detail & Related papers (2025-09-23T06:46:41Z) - WebSynthesis: World-Model-Guided MCTS for Efficient WebUI-Trajectory Synthesis [34.998277998052444]
We propose WebSynthesis, a novel framework for trajectory synthesis and training.<n>We show that an agent trained using WebSynthesis on a small-scale synthetic dataset achieves performance comparable to or even surpassing that of models trained on large-scale real-world data.
arXiv Detail & Related papers (2025-07-06T12:31:10Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior.
The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.