Related papers: Self-Challenging Language Model Agents

Self-Challenging Language Model Agents

URL: http://arxiv.org/abs/2506.01716v1
Date: Mon, 02 Jun 2025 14:23:33 GMT
Title: Self-Challenging Language Model Agents
Authors: Yifei Zhou, Sergey Levine, Jason Weston, Xian Li, Sainbayar Sukhbaatar,
Abstract summary: We propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself.<n>The framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.
Score: 98.62637336505242
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.

Related papers

Synthesizing High-Quality Programming Tasks with LLM-based Expert and Student Agents [26.884829816265174]
PyTaskSyn is a novel synthesis technique that first generates a programming task and then decides whether it meets certain quality criteria to be given to students.<n>We show that PyTaskSyn significantly improves task quality compared to baseline techniques and showcases the importance of each specialized agent type in our validation pipeline.
arXiv Detail & Related papers (2025-04-10T11:08:39Z)
Learning to Use Tools via Cooperative and Interactive Agents [58.77710337157665]
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. We propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. Our experiments on three datasets show that the LLMs, when equipped with ConAgents, outperform baselines with substantial improvement.
arXiv Detail & Related papers (2024-03-05T15:08:16Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
A Zero-Shot Language Agent for Computer Control with Structured Reflection [19.526676887048662]
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment. To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. We approach this problem with a zero-shot agent that requires no given expert traces.
arXiv Detail & Related papers (2023-10-12T21:53:37Z)
Collaborating with language models for embodied reasoning [30.82976922056617]
Reasoning in a complex and ambiguous environment is a key goal for Reinforcement Learning (RL) agents. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases.
arXiv Detail & Related papers (2023-02-01T21:26:32Z)
Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning [13.468555224407764]
We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.
arXiv Detail & Related papers (2022-07-05T18:23:20Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents. We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z)
Automatic Curriculum Learning through Value Disagreement [95.19299356298876]
Continually solving new, unsolved tasks is the key to learning diverse behaviors. In the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency. We propose setting up an automatic curriculum for goals that the agent needs to solve. We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods.
arXiv Detail & Related papers (2020-06-17T03:58:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.