Related papers: Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

URL: http://arxiv.org/abs/2311.08596v2
Date: Wed, 21 Feb 2024 18:15:47 GMT
Title: Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment
Authors: Philippe Laban and Lidiya Murakhovs'ka and Caiming Xiong and Chien-Sheng Wu
Abstract summary: We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs) We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect) We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
Score: 82.60594940370919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop experiment: in the first round of the conversation, an LLM completes a classification task. In a second round, the LLM is challenged with a follow-up phrase like "Are you sure?", offering an opportunity for the model to reflect on its initial answer, and decide whether to confirm or flip its answer. A systematic study of ten LLMs on seven classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect). We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely. The FlipFlop experiment illustrates the universality of sycophantic behavior in LLMs and provides a robust framework to analyze model behavior and evaluate future models.

Related papers

Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback [20.985320124495566]
LLMs possess some ability to improve their responses when given external feedback.<n>It remains unclear how effectively and thoroughly these models can incorporate external feedback.
arXiv Detail & Related papers (2025-06-13T16:31:51Z)
"Check My Work?": Measuring Sycophancy in a Simulated Educational Context [0.0]
This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context.<n>We show that response quality varies dramatically based on query framing.<n>Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.
arXiv Detail & Related papers (2025-06-12T02:21:43Z)
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection [11.353302879735862]
Open-sourced Large Language Models (LLMs) and diverse downstream tasks require efficient model selection.<n>We propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs.<n>In particular, we first derive a PAC-Bayesian Generalization Bound that unveils fine-tuning dynamics of LLMs.<n>We then introduce LENSLLM, a Neural Tangent Kernel (NTK)-based Rectified Scaling Model that enables accurate performance predictions.
arXiv Detail & Related papers (2025-05-01T15:07:32Z)
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering [23.176280217718002]
Multiple-choice exam questions with "None of the above" (NA) options have been extensively studied in educational testing. We examine how NA options affect model performance and confidence calibration. Our results highlight important implications for benchmark design and raise questions about LLMs' ability to handle uncertainty in real-world applications.
arXiv Detail & Related papers (2025-03-03T13:55:29Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization. We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification? [2.1861408994125253]
Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks. Recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only.
arXiv Detail & Related papers (2024-10-14T13:10:45Z)
I've got the "Answer"! Interpretation of LLMs Hidden States in Question Answering [0.0]
This paper investigates the interpretation of large language models (LLMs) in the context of the knowledge-based question answering. The main hypothesis of the study is that correct and incorrect model behavior can be distinguished at the level of hidden states.
arXiv Detail & Related papers (2024-06-04T07:43:12Z)
Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs) It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z)
Large Language Models Show Human-like Social Desirability Biases in Survey Responses [12.767606361552684]
We show that Large Language Models (LLMs) skew their scores towards the desirable ends of trait dimensions when personality evaluation is inferred. This bias exists in all tested models, including GPT-4/3.5, Claude 3, Llama 3, and PaLM-2. reverse-coding all the questions decreases bias levels but does not eliminate them, suggesting that this effect cannot be attributed to acquiescence bias.
arXiv Detail & Related papers (2024-05-09T19:02:53Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.