Related papers: SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

URL: http://arxiv.org/abs/2404.04298v3
Date: Fri, 6 Sep 2024 01:14:26 GMT
Title: SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Authors: Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel Weir, Benjamin Van Durme, Daniel Khashabi,
Abstract summary: We show that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
Score: 49.148206387394936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.

Related papers

LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests [7.6818904666624395]
This paper proposes a dual-LLM system and experiments with the usage of LLMs for the generation of compiler tests.<n>It is evident that LLMs possess the promising potential to generate quality compiler tests and verify them automatically.
arXiv Detail & Related papers (2025-07-29T02:34:28Z)
Generating Diverse Training Samples for Relation Extraction with Large Language Models [30.196619805354622]
We study how to effectively improve the diversity of the training samples generated with Large Language Models (LLMs) for Relation Extraction (RE)<n>Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data.
arXiv Detail & Related papers (2025-05-29T05:21:54Z)
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities [21.42711537107199]
We study why Large Language Models (LLMs) perform sub-optimally in decision-making scenarios. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales.
arXiv Detail & Related papers (2025-04-22T17:57:14Z)
LLM-Powered Preference Elicitation in Combinatorial Assignment [17.367432304040662]
We study the potential of large language models (LLMs) as proxies for humans to simplify preference elicitation (PE) in assignment. We propose a framework for LLM proxies that can work in tandem with SOTA ML-powered preference elicitation schemes. We experimentally evaluate the efficiency of LLM proxies against human queries in the well-studied course allocation domain.
arXiv Detail & Related papers (2025-02-14T17:12:20Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs) We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge.
arXiv Detail & Related papers (2024-08-27T17:57:45Z)
Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models [0.0]
Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains. LLMs often suffer from the "hallucination problem", where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated.
arXiv Detail & Related papers (2024-08-09T14:34:32Z)
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism [39.392450788666814]
Current evaluations of large language models (LLMs) often overlook non-determinism. greedy decoding generally outperforms sampling methods for most evaluated tasks. Smaller LLMs can match or surpass larger models such as GPT-4-Turbo.
arXiv Detail & Related papers (2024-07-15T06:12:17Z)
Direct-Inverse Prompting: Analyzing LLMs' Discriminative Capacity in Self-Improving Generation [15.184067502284007]
Even the most advanced LLMs experience uncertainty in their outputs, often producing varied results on different runs or when faced with minor changes in input. We propose and analyze three discriminative prompts: direct, inverse, and hybrid. Our insights reveal which discriminative prompt is most promising and when to use it.
arXiv Detail & Related papers (2024-06-27T02:26:47Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal [49.24054920683246]
Large language models (LLMs) suffer from catastrophic forgetting during continual learning. We propose a framework called Self-Synthesized Rehearsal (SSR) that uses the LLM to generate synthetic instances for rehearsal.
arXiv Detail & Related papers (2024-03-02T16:11:23Z)
The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust. It asks necessary questions to decide when an LLM should refine its output. It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z)
LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated. We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z)
Self-Refine: Iterative Refinement with Self-Feedback [62.78755306241981]
Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
arXiv Detail & Related papers (2023-03-30T18:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.