Related papers: RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

URL: http://arxiv.org/abs/2402.13463v4
Date: Wed, 24 Jul 2024 06:50:18 GMT
Title: RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models
Authors: Jianhao Yan, Yun Luo, Yue Zhang,
Abstract summary: This paper proposes a benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation.
Score: 17.782410287625645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model's output, hoping for a responsive model that can complete responses according to their feedback. Whether the model can appropriately respond to users' refuting feedback and consistently follow through with execution has not been thoroughly analyzed. In light of this, this paper proposes a comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation. We conduct evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit inclination to their internal knowledge, often failing to comply with user feedback. Additionally, as the length of the conversation increases, models gradually forget the user's stated feedback and roll back to their own responses. We further propose a recall-and-repeat prompts as a simple and effective way to enhance the model's responsiveness to feedback.

Related papers

User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal [58.43749783815486]
We study implicit user feedback in two user-LM interaction datasets.<n>We find that the contents of user feedback can improve model performance in short human-designed questions.<n>We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt.
arXiv Detail & Related papers (2025-07-30T23:33:29Z)
Interactive Reasoning: Visualizing and Controlling Chain-of-Thought Reasoning in Large Language Models [54.85405423240165]
We introduce Interactive Reasoning, an interaction design that visualizes chain-of-thought outputs as a hierarchy of topics.<n>We implement interactive reasoning in Hippo, a prototype for AI-assisted decision making in the face of uncertain trade-offs.
arXiv Detail & Related papers (2025-06-30T10:00:43Z)
Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback [20.985320124495566]
LLMs possess some ability to improve their responses when given external feedback.<n>It remains unclear how effectively and thoroughly these models can incorporate external feedback.
arXiv Detail & Related papers (2025-06-13T16:31:51Z)
Reinforcement Learning from User Feedback [28.335218244885706]
We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning large language models with user preferences.<n>We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction.<n>We show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior.
arXiv Detail & Related papers (2025-05-20T22:14:44Z)
RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction [17.782410287625645]
We introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators. We design both transient and persistent refutation instructions with different validity periods. We show that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans.
arXiv Detail & Related papers (2025-02-25T15:51:25Z)
LFOSum: Summarizing Long-form Opinions with Large Language Models [7.839083566878183]
This paper introduces (1) a new dataset of long-form user reviews, each entity comprising over a thousand reviews, (2) two training-free LLM-based summarization approaches that scale to long inputs, and (3) automatic evaluation metrics. Our dataset of user reviews is paired with in-depth and unbiased critical summaries by domain experts, serving as a reference for evaluation. Our evaluation reveals that LLMs still face challenges in balancing sentiment and format adherence in long-form summaries, though open-source models can narrow the gap when relevant information is retrieved in a focused manner.
arXiv Detail & Related papers (2024-10-16T20:52:39Z)
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback [36.06000681394939]
We introduce WildFeedback, a novel framework that leverages in-situ user feedback during conversations with large language models (LLMs) to create preference datasets automatically. Our experiments demonstrate that LLMs fine-tuned on WildFeedback dataset exhibit significantly improved alignment with user preferences.
arXiv Detail & Related papers (2024-08-28T05:53:46Z)
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z)
RLVF: Learning from Verbal Feedback without Overgeneralization [94.19501420241188]
We study the problem of incorporating verbal feedback without such overgeneralization. We develop a new method Contextualized Critiques with Constrained Preference Optimization (C3PO) Our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts.
arXiv Detail & Related papers (2024-02-16T18:50:24Z)
What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception [53.4840989321394]
We analyze the effect of rationales generated by QA models to support their answers. We present users with incorrect answers and corresponding rationales in various formats. We measure the effectiveness of this feedback in patching these rationales through in-context learning.
arXiv Detail & Related papers (2023-11-16T04:26:32Z)
EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria [43.944632774725484]
We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail. A comparative study showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions.
arXiv Detail & Related papers (2023-09-24T13:19:38Z)
Learning from Negative User Feedback and Measuring Responsiveness for Sequential Recommenders [13.762960304406016]
We introduce explicit and implicit negative user feedback into the training objective of sequential recommenders. We demonstrate the effectiveness of this approach using live experiments on a large-scale industrial recommender system.
arXiv Detail & Related papers (2023-08-23T17:16:07Z)
Continually Improving Extractive QA via Human Feedback [59.49549491725224]
We study continually improving an extractive question answering (QA) system via human user feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time.
arXiv Detail & Related papers (2023-05-21T14:35:32Z)
When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels [34.6235464256814]
Juicer is a framework to make use of both binary and free-form textual human feedback. We find that augmenting training with model-corrected replies improves the final dialogue model.
arXiv Detail & Related papers (2022-10-28T04:57:21Z)
Simulating Bandit Learning from User Feedback for Extractive Question Answering [51.97943858898579]
We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We show that systems initially trained on a small number of examples can dramatically improve given feedback from users on model-predicted answers.
arXiv Detail & Related papers (2022-03-18T17:47:58Z)
Automating App Review Response Generation [67.58267006314415]
We propose a novel approach RRGen that automatically generates review responses by learning knowledge relations between reviews and their responses. Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms the baselines by at least 67.4% in terms of BLEU-4.
arXiv Detail & Related papers (2020-02-10T05:23:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.