Related papers: Assessing and Verifying Task Utility in LLM-Powered Applications

Assessing and Verifying Task Utility in LLM-Powered Applications

URL: http://arxiv.org/abs/2405.02178v2
Date: Sun, 12 May 2024 15:52:49 GMT
Title: Assessing and Verifying Task Utility in LLM-Powered Applications
Authors: Negar Arabzadeh, Siqing Huo, Nikhil Mehta, Qinqyun Wu, Chi Wang, Ahmed Awadallah, Charles L. A. Clarke, Julia Kiseleva,
Abstract summary: Large Language Models (LLMs) have led to a surge in applications that facilitate collaboration among agents, assisting humans in their daily tasks. This highlights the need to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs. We introduce AgentEval, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application.
Score: 28.41607905656699
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents, assisting humans in their daily tasks. However, a significant gap remains in assessing to what extent LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the need to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs. We introduce AgentEval, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria. We present a comprehensive analysis of the effectiveness and robustness of AgentEval for two open source datasets including Math Problem solving and ALFWorld House-hold related tasks. For reproducibility purposes, we make the data, code and all the logs publicly available at https://bit.ly/3w3yKcS .

Related papers

EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
Goal2Story: A Multi-Agent Fleet based on Privately Enabled sLLMs for Impacting Mapping on Requirements Elicitation [6.547589336272875]
Goal2Story is a multi-agent fleet that adopts the Impact Mapping (IM) framework while merely using cost-effective sLLMs for goal-driven RE. StorySeek dataset contains over 1,000 user stories (USs) with corresponding goals and project context information. For evaluation, we proposed two metrics: Factuality Hit Rate (FHR) to measure consistency between the generated USs with the dataset and Quality And Consistency Evaluation (QuACE) to evaluate the quality of the generated USs.
arXiv Detail & Related papers (2025-03-17T15:31:20Z)
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely [8.507599833330346]
Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Retrieval-Augmented Generation (RAG) and fine-tuning are gaining increasing attention and widespread application. However, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges.
arXiv Detail & Related papers (2024-09-23T11:20:20Z)
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation [51.27062359412488]
Office automation significantly enhances human productivity by automatically finishing routine tasks in the workflow. We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic office. Applying our customized evaluation methods on each task, we find that GPT-4 Omni achieves the highest pass rate of 47.00%, demonstrating a decent performance in handling office tasks.
arXiv Detail & Related papers (2024-07-26T19:27:17Z)
Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks such as question answering (QA) This paper presents a comprehensive benchmarking study comparing open-source LLMs with their non-open-source counterparts on the task of question answering.
arXiv Detail & Related papers (2024-06-19T17:11:51Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations. Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z)
Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications [37.047117782796064]
AgentEval is a framework designed to simplify the utility verification process. We present a comprehensive analysis of the robustness of quantifier's work.
arXiv Detail & Related papers (2024-02-14T08:46:15Z)
T-RAG: Lessons from the LLM Trenches [7.545277950323593]
Application area is question answering over private enterprise documents. Retrieval-Augmented Generation is most prominent framework for building LLM-based applications. System, which we call Tree-RAG (T-RAG), uses a tree structure to represent entity hierarchies.
arXiv Detail & Related papers (2024-02-12T08:45:08Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions [114.67699010359637]
We analyze a large-scale collection of real user queries to GPT. We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
arXiv Detail & Related papers (2023-10-19T02:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.