Related papers: $τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

URL: http://arxiv.org/abs/2406.12045v1
Date: Mon, 17 Jun 2024 19:33:08 GMT
Title: $τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Authors: Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan,
Abstract summary: $tau$-bench is a benchmark emulating dynamic conversations between a user and a language agent. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state.
Score: 43.43344028212623
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

Related papers

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z)
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation [13.440594349043916]
We develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG) Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs) We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs.
arXiv Detail & Related papers (2025-02-24T13:58:42Z)
Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems [6.8738526619759535]
offline datasets have been used to evaluate task-oriented dialogue (TOD) models. User-agents, which are context-aware, can simulate the variability and unpredictability of human conversations.
arXiv Detail & Related papers (2024-11-15T06:05:45Z)
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process. We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation [73.454943870226]
Language models have shown impressive in-context-learning capabilities. We propose a measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation.
arXiv Detail & Related papers (2024-06-17T06:14:55Z)
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level [23.833528781431884]
Social Simulation Tasks in Sandbox (STSS) benchmark is a language-level benchmark for multi-agent simulation. Our evaluative findings highlight that the STSS benchmark is challenging for state-of-the-art language agents.
arXiv Detail & Related papers (2024-04-08T09:25:32Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [31.509994889286183]
We introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of language models (LMs) in reasoning, acting, and planning. A key feature of our approach is the incorporation of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism. LATS achieves state-of-the-art pass@1 accuracy (92.7%) for programming on HumanEval with GPT-4 and demonstrates gradient-free performance (average score of 75.9) comparable to gradient-based fine-tuning for web navigation on WebShop with GPT
arXiv Detail & Related papers (2023-10-06T17:55:11Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.