Understanding the Dark Side of LLMs' Intrinsic Self-Correction
- URL: http://arxiv.org/abs/2412.14959v1
- Date: Thu, 19 Dec 2024 15:39:31 GMT
- Title: Understanding the Dark Side of LLMs' Intrinsic Self-Correction
- Authors: Qingjie Zhang, Han Qiu, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, Minlie Huang,
- Abstract summary: Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability.
Recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts.
We identify intrinsic self-correction can cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions.
- Score: 55.51468462722138
- License:
- Abstract: Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs' intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs' intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.
Related papers
- Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically.
The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z) - Large Language Models have Intrinsic Self-Correction Ability [18.79203446847577]
Large language models (LLMs) have attracted significant attention for their exceptional abilities in various natural language processing tasks.
One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation.
In intrinsic self-correction is considered a promising direction because it does not utilize external knowledge.
arXiv Detail & Related papers (2024-06-21T22:29:40Z) - LLM-Generated Black-box Explanations Can Be Adversarially Helpful [16.49758711633611]
Large Language Models (LLMs) help us solve and understand complex problems by acting as digital assistants.
Our research uncovers a hidden risk tied to this approach, which we call *adversarial helpfulness*.
This happens when an LLM's explanations make a wrong answer look right, potentially leading people to trust incorrect solutions.
arXiv Detail & Related papers (2024-05-10T20:23:46Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience [41.06726400259579]
Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks.
We propose a method of Learning from Past experience (LePe) to enhance the capability for confidence expression.
arXiv Detail & Related papers (2024-04-16T06:47:49Z) - Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps.
Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%.
FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.
Recent literature reveals that LLMs generate nonfactual responses intermittently.
We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.