Related papers: Are DeepSeek R1 And Other Reasoning Models More Faithful?

Are DeepSeek R1 And Other Reasoning Models More Faithful?

URL: http://arxiv.org/abs/2501.08156v4
Date: Thu, 20 Feb 2025 02:48:34 GMT
Title: Are DeepSeek R1 And Other Reasoning Models More Faithful?
Authors: James Chua, Owain Evans,
Abstract summary: We evaluate three reasoning models based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base.<n>We test whether models can describe how a cue in their prompt influences their answer to MMLU questions.<n> Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested.
Score: 2.0429566123690455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models trained to solve reasoning tasks via reinforcement learning have achieved striking results. We refer to these models as reasoning models. A key question emerges: Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? To investigate this, we evaluate three reasoning models (based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base) on an existing test of faithful CoT. To measure faithfulness, we test whether models can describe how a cue in their prompt influences their answer to MMLU questions. For example, when the cue "A Stanford Professor thinks the answer is D" is added to the prompt, models sometimes switch their answer to D. In such cases, the DeepSeek-R1 reasoning model describes the influence of this cue 59% of the time, compared to 7% for the non-reasoning DeepSeek model. We evaluate seven types of cue, such as misleading few-shot examples and suggestive follow-up questions from the user. Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested (including Claude-3.5-Sonnet and GPT-4). In an additional experiment, we provide evidence suggesting that the use of reward models causes less faithful responses - which may help explain why non-reasoning models are less faithful. Our study has two main limitations. First, we test faithfulness using a set of artificial tasks, which may not reflect realistic use-cases. Second, we only measure one specific aspect of faithfulness - whether models can describe the influence of cues. Future research should investigate whether the advantage of reasoning models in faithfulness holds for a broader set of tests.

Related papers

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [23.190823296729732]
We study whether reasoning models encode information about answer correctness through probing the model's hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores.
arXiv Detail & Related papers (2025-04-07T18:42:01Z)
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful [3.048751803239144]
Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. We show that unfaithful CoT can occur on realistic prompts with no artificial bias. Specifically, we find that models rationalize their implicit biases in answers to binary questions.
arXiv Detail & Related papers (2025-03-11T17:56:30Z)
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [70.77691645678804]
We present the first successful replication of emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding both SFT setting by 2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models.
arXiv Detail & Related papers (2025-03-07T04:21:47Z)
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [28.565225092457897]
Reinforcement learning can drive self-improvement in language models on verifiable tasks. We find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them.
arXiv Detail & Related papers (2025-03-03T08:46:22Z)
Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries [85.909363478929]
In this study, we focus on 19 real-world statistics collected from authoritative sources. We develop a checklist comprising objective and subjective queries to analyze behavior of large language models. We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects.
arXiv Detail & Related papers (2025-02-09T10:54:11Z)
Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints. We empirically find that this training paradigm limits the one-step generation performance of consistency models. We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z)
SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction [89.56181323849512]
SuperCorrect is a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model.
arXiv Detail & Related papers (2024-10-11T17:25:52Z)
What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model. Previous studies have primarily focused on merging a few small models. This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z)
A Training Rate and Survival Heuristic for Inference and Robustness Evaluation (TRASHFIRE) [1.622320874892682]
This work addresses the problem of understanding and predicting how particular model hyper- parameters influence the performance of a model in the presence of an adversary. The proposed approach uses survival models, worst-case examples, and a cost-aware analysis to precisely and accurately reject a particular model change. Using the proposed methodology, we show that ResNet is hopelessly against even the simplest of white box attacks.
arXiv Detail & Related papers (2024-01-24T19:12:37Z)
What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception [53.4840989321394]
We analyze the effect of rationales generated by QA models to support their answers. We present users with incorrect answers and corresponding rationales in various formats. We measure the effectiveness of this feedback in patching these rationales through in-context learning.
arXiv Detail & Related papers (2023-11-16T04:26:32Z)
A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check. Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models. The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z)
Measuring Faithfulness in Chain-of-Thought Reasoning [19.074147845029355]
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question. It is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question) We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT.
arXiv Detail & Related papers (2023-07-17T01:08:39Z)
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models [74.40196814292426]
We propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph. GoT captures the non-sequential nature of human thinking and allows for a more realistic modeling of thought processes. We evaluate GoT's performance on a text-only reasoning task and a multimodal reasoning task.
arXiv Detail & Related papers (2023-05-26T02:15:09Z)
Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors [41.45649235969172]
Self-ensemble protection (SEP) is proposed to prevent training good models on the data. SEP is verified to be a new state-of-the-art, e.g., our small perturbations reduce the accuracy of a CIFAR-10 ResNet18 from 94.56% to 14.68%, compared to 41.35% by the best-known method.
arXiv Detail & Related papers (2022-11-22T04:54:20Z)
Robust Models are less Over-Confident [10.42820615166362]
adversarial training (AT) aims to achieve robustness against such attacks. We empirically analyze a variety of adversarially trained models that achieve high robust accuracies. AT has an interesting side-effect: it leads to models that are significantly less overconfident with their decisions.
arXiv Detail & Related papers (2022-10-12T06:14:55Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions. Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z)
UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors. We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To? [0.0]
We argue that the standard evaluation metric, which consists in measuring the overall in-domain accuracy, is misleading. We propose the GQA-OOD benchmark designed to overcome these concerns.
arXiv Detail & Related papers (2020-06-09T08:50:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.