Towards better Human-Agent Alignment: Assessing Task Utility in
LLM-Powered Applications
- URL: http://arxiv.org/abs/2402.09015v3
- Date: Thu, 22 Feb 2024 23:49:10 GMT
- Title: Towards better Human-Agent Alignment: Assessing Task Utility in
LLM-Powered Applications
- Authors: Negar Arabzadeh and Julia Kiseleva and Qingyun Wu and Chi Wang and
Ahmed Awadallah and Victor Dibia and Adam Fourney and Charles Clarke
- Abstract summary: AgentEval is a framework designed to simplify the utility verification process.
We present a comprehensive analysis of the robustness of quantifier's work.
- Score: 37.047117782796064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development in the field of Large Language Models (LLMs) has led to
a surge in applications that facilitate collaboration among multiple agents to
assist humans in their daily tasks. However, a significant gap remains in
assessing whether LLM-powered applications genuinely enhance user experience
and task execution efficiency. This highlights the pressing need for methods to
verify utility of LLM-powered applications, particularly by ensuring alignment
between the application's functionality and end-user needs. We introduce
AgentEval provides an implementation for the math problems, a novel framework
designed to simplify the utility verification process by automatically
proposing a set of criteria tailored to the unique purpose of any given
application. This allows for a comprehensive assessment, quantifying the
utility of an application against the suggested criteria. We present a
comprehensive analysis of the robustness of quantifier's work.
Related papers
- WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence.
WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z) - Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks such as question answering (QA)
This paper presents a comprehensive benchmarking study comparing open-source LLMs with their non-open-source counterparts on the task of question answering.
arXiv Detail & Related papers (2024-06-19T17:11:51Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Assessing and Verifying Task Utility in LLM-Powered Applications [28.41607905656699]
Large Language Models (LLMs) have led to a surge in applications that facilitate collaboration among agents, assisting humans in their daily tasks.
This highlights the need to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs.
We introduce AgentEval, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application.
arXiv Detail & Related papers (2024-05-03T15:26:27Z) - RepEval: Effective Text Evaluation with LLM Representation [54.07909112633993]
We introduce RepEval, the first metric leveraging the projection of LLM representations for evaluation.
RepEval requires minimal sample pairs for training, and through simple prompt modifications, it can easily transition to various tasks.
Results on ten datasets from three tasks demonstrate the high effectiveness of our method.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - TaskBench: Benchmarking Large Language Models for Task Automation [85.3879908356586]
We introduce TaskBench to evaluate the capability of large language models in task automation.
To generate high-quality evaluation datasets, we introduce the concept of Tool Graph.
We also propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions [114.67699010359637]
We analyze a large-scale collection of real user queries to GPT.
We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
arXiv Detail & Related papers (2023-10-19T02:12:17Z) - Formally Specifying the High-Level Behavior of LLM-Based Agents [24.645319505305316]
LLMs have emerged as promising tools for solving challenging problems without the need for task-specific finetuned models.
Currently, the design and implementation of such agents is ad hoc, as the wide variety of tasks that LLM-based agents may be applied to naturally means there can be no one-size-fits-all approach to agent design.
We propose a minimalistic generation framework that simplifies the process of building agents.
arXiv Detail & Related papers (2023-10-12T17:24:15Z) - TPTU: Large Language Model-based AI Agents for Task Planning and Tool
Usage [28.554981886052953]
Large Language Models (LLMs) have emerged as powerful tools for various real-world applications.
Despite their prowess, intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks.
This paper proposes a structured framework tailored for LLM-based AI Agents.
arXiv Detail & Related papers (2023-08-07T09:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.