Related papers: The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

URL: http://arxiv.org/abs/2409.14371v1
Date: Sun, 22 Sep 2024 09:27:42 GMT
Title: The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests
Authors: Lior Madmoni, Amir Zait, Ilia Labzovsky, Danny Karmon,
Abstract summary: We develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. This dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each constraint's satisfaction level in the response. We show that most models still have a significant headroom for improvement, and that errors primarily stem from reasoning issues.
Score: 0.6249768559720121
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative AI agents are often expected to respond to complex user requests that have No One Right Answer (NORA), e.g., "design a vegetarian meal plan below 1800 calories". Such requests may entail a set of constraints that the agent should adhere to. To successfully develop agents for NORA scenarios, an accurate automatic evaluation framework is essential, and specifically - one capable of validating the satisfaction of constraints in the agent's response. Recently, large language models (LLMs) have been adopted as versatile evaluators for many NORA tasks, but their ability to evaluate constraint-satisfaction in generated text remains unclear. To study this, we develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. The dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each constraint's satisfaction level in the response. A unique property of this dataset is that validating many of its constraints requires reviewing the response as a whole (in contrast to many other benchmarks that require the validation of a single independent item). Moreover, it assesses LLMs in performing reasoning, in-context data extraction, arithmetic calculations, and counting. We then benchmark both open and proprietary LLMs on evaluating constraint-satisfaction, and show that most models still have a significant headroom for improvement, and that errors primarily stem from reasoning issues. In addition, most models exhibit a skewed constraint-satisfaction prediction pattern, with higher accuracy where the ground-truth label is "satisfied". Lastly, few-shot prompting for our task proved to be rather challenging, since many of the studied models showed a degradation in performance when it was introduced.

Related papers

RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data [37.631782007066214]
RECAST is a novel framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks.<n>We construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 15 constraint types.<n> Experimental results demonstrate that models fine-tuned on RECAST-30K show substantial improvements in following complex instructions.
arXiv Detail & Related papers (2025-05-25T08:31:08Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval. We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z)
VERA: Validation and Enhancement for Retrieval Augmented systems [0.0]
We propose textbfVERA (textbfValidation and textbfEnhancement for textbfRetrieval textbfAugmented systems), a system designed to evaluate and enhance the retrieved context before response generation. VERA employs an evaluator-cum-enhancer LLM that first checks if external retrieval is necessary, evaluates the relevance and redundancy of the retrieved context, and refines it to eliminate non-essential information.
arXiv Detail & Related papers (2024-09-18T16:10:47Z)
KaPQA: Knowledge-Augmented Product Question-Answering [59.096607961704656]
We introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products. We also propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task.
arXiv Detail & Related papers (2024-07-22T22:14:56Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z)
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA) We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z)
From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification [70.08146540745877]
We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments. We propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints.
arXiv Detail & Related papers (2024-03-10T22:14:54Z)
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval [23.3454086714842]
We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval. We present KITAB, a new dataset for measuring constraint satisfaction abilities of language models.
arXiv Detail & Related papers (2023-10-24T04:40:38Z)
From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms. We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation. Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z)
Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data [0.0]
We leverage advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks. We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task. We demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.
arXiv Detail & Related papers (2023-06-01T20:56:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.