Related papers: Discourse Heuristics For Paradoxically Moral Self-Correction

Discourse Heuristics For Paradoxically Moral Self-Correction

URL: http://arxiv.org/abs/2507.00985v1
Date: Tue, 01 Jul 2025 17:36:41 GMT
Title: Discourse Heuristics For Paradoxically Moral Self-Correction
Authors: Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson,
Abstract summary: Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models with human moral values.<n>We show that moral self-correction relies on discourse constructions that reflect shortcuts.<n>We propose a solution to improve moral self-correction by leveraging the generalizations of curated datasets.
Score: 6.360181137608509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

Related papers

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs [57.10533368622962]
Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance.<n>This study introduces CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies.<n>Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; and 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs.
arXiv Detail & Related papers (2025-10-17T02:40:19Z)
On the Convergence of Moral Self-Correction in Large Language Models [26.724972162483855]
Large Language Models (LLMs) are able to improve their responses when instructed to do so.<n>LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction.<n>We reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions.
arXiv Detail & Related papers (2025-10-08T17:46:27Z)
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification.<n>Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z)
Self-correction is Not An Innate Capability in Large Language Models: A Case Study of Moral Self-correction [8.61034573238112]
We argue that moral self-correction is not an innate capability of Large Language Models (LLMs)<n>We conduct a mechanistic analysis of how key components of self-correction, such as Chain-of-Thought (CoT) reasoning and external feedback, interact to enable moral self-correction.
arXiv Detail & Related papers (2024-10-27T16:52:21Z)
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis [35.734425912914176]
Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states.
arXiv Detail & Related papers (2024-07-21T22:50:11Z)
Large Language Models have Intrinsic Self-Correction Ability [18.79203446847577]
Large language models (LLMs) have attracted significant attention for their exceptional abilities in various natural language processing tasks.<n>One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation.<n>In intrinsic self-correction is considered a promising direction because it does not utilize external knowledge.
arXiv Detail & Related papers (2024-06-21T22:29:40Z)
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept [36.27550578296276]
Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. In intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. We show that intrinsic self-correction can be progressively improved, allowing it to approach a converged state.
arXiv Detail & Related papers (2024-06-04T14:55:43Z)
A Theoretical Understanding of Self-Correction through In-context Alignment [51.622068973630796]
Large language models (LLMs) are capable of improving their abilities purely by self-correction. We show that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Inspired by these findings, we also illustrate applications of self-correction, such as defending against LLM jailbreaks.
arXiv Detail & Related papers (2024-05-28T22:33:02Z)
Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks. We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities. Concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
arXiv Detail & Related papers (2023-10-03T04:56:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.