Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral
- URL: http://arxiv.org/abs/2406.10400v2
- Date: Sun, 16 Feb 2025 09:33:15 GMT
- Title: Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral
- Authors: Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, Talal Rahwan,
- Abstract summary: We show that self-reflection can lead to safer (75.8% reduction in toxic responses while preserving 97.8% non-toxic ones), less biased (77% reduction in gender biased responses, while preserving 94.3% unbiased ones), and more ideologically neutral responses (100% reduction in partisan leaning response, while preserving 87.7% non-partisan ones)
The paper concludes by discussing the implications of our findings on the deployment of large language models.
- Score: 1.472830326343432
- License:
- Abstract: Previous studies proposed that the reasoning capabilities of large language models (LLMs) can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, earlier experiments offer mixed results when it comes to the benefits of self-reflection. Furthermore, prior studies on self-reflection are predominantly concerned with the reasoning capabilities of models, ignoring the potential for self-reflection in safety, bias, and ideological leaning. Here, by conducting a series of experiments testing LLM's self-reflection capability in various tasks using a variety of prompts and different LLMs, we make several contributions to the literature. First, we reconcile conflicting findings regarding the benefit of self-reflection, by demonstrating that the outcome of self-reflection is sensitive to prompt wording -- both the original prompt that are used to elicit an initial answer and the subsequent prompt used to self-reflect. Specifically, although self-reflection may improve the reasoning capability of LLMs when the initial response is simple, the technique cannot improve upon the state-of-the-art chain-of-thought (CoT) prompting. Second, we show that self-reflection can lead to safer (75.8\% reduction in toxic responses while preserving 97.8\% non-toxic ones), less biased (77\% reduction in gender biased responses, while preserving 94.3\% unbiased ones), and more ideologically neutral responses (100\% reduction in partisan leaning response, while preserving 87.7\% non-partisan ones). The paper concludes by discussing the implications of our findings on the deployment of large language models. We release our experiments at https://github.com/Michael98Liu/self-reflection.
Related papers
- Large Language Models have Intrinsic Self-Correction Ability [18.79203446847577]
Large language models (LLMs) have attracted significant attention for their exceptional abilities in various natural language processing tasks.
One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation.
In intrinsic self-correction is considered a promising direction because it does not utilize external knowledge.
arXiv Detail & Related papers (2024-06-21T22:29:40Z) - A Theoretical Understanding of Self-Correction through In-context Alignment [51.622068973630796]
Large language models (LLMs) are capable of improving their abilities purely by self-correction.
We show that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way.
Inspired by these findings, we also illustrate applications of self-correction, such as defending against LLM jailbreaks.
arXiv Detail & Related papers (2024-05-28T22:33:02Z) - "I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance.
We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy.
Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z) - When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models [15.781930031346105]
Self-reflection enhances performance in TruthfulQA, but adversely affects results in HotpotQA.
We find that self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher.
Based on our findings, we propose guidelines for decisions on when to implement self-reflection.
arXiv Detail & Related papers (2024-04-14T02:47:32Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives [45.87069217634753]
Research indicates without external feedback, Large Language Model's intrinsic reflection is unstable.
Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback.
We advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies.
arXiv Detail & Related papers (2024-01-04T00:32:33Z) - Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities.
Concerns persist regarding the accuracy and appropriateness of their generated content.
A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
arXiv Detail & Related papers (2023-10-03T04:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.