Related papers: ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

URL: http://arxiv.org/abs/2510.09062v1
Date: Fri, 10 Oct 2025 07:08:44 GMT
Title: ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
Authors: Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng,
Abstract summary: We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability.<n>We propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to improve interpretability.<n>Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces.
Score: 23.70973331911138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine

Related papers

Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems [94.9141394384021]
Individual agents in multi-agent systems often lack robustness, tending to blindly conform to misleading peers.<n>We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability.<n>We first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input.<n>We then develop Epistemic Context Learning (ECL), a reasoning framework that conditions predictions on explicitly-built peer profiles from history.
arXiv Detail & Related papers (2026-01-29T13:59:32Z)
Revisiting the Reliability of Language Models in Instruction-Following [15.281163913211818]
LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval.<n>We study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances.<n>Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior.
arXiv Detail & Related papers (2025-12-15T02:57:55Z)
ConfTuner: Training Large Language Models to Express Their Confidence Verbally [58.63318088243125]
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare.<n>LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence"
arXiv Detail & Related papers (2025-08-26T09:25:32Z)
Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z)
Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints [18.10515528600634]
We propose textbfDeliberative Searcher, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering.<n>The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforcement learning algorithm that optimize for accuracy under a soft reliability constraint.
arXiv Detail & Related papers (2025-07-22T16:09:34Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks.<n>They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct.<n>We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
Large Language Model Confidence Estimation via Black-Box Access [30.490207799344333]
We explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them.<n>We propose a simple and generalize framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence.<n>We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2,-13b, Mistral-7b and GPT-4 on four benchmark Q&A tasks as well as Pegasus-large and BART-large on two benchmark summarization tasks.
arXiv Detail & Related papers (2024-06-01T02:08:44Z)
Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models [14.5291643644017]
We introduce the concept of Confidence-Probability Alignment. We probe the alignment between models' internal and expressed confidence. Among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment.
arXiv Detail & Related papers (2024-05-25T15:42:04Z)
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness [24.843692458375436]
This study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals.<n>Our results demonstrate that RLHF on human preferences doesn't automatically guarantee trustworthiness, and reverse effects are often observed.<n>We propose to adapt efficient influence function based data attribution methods to the RLHF setting to better understand the influence of fine-tuning data on individual trustworthiness benchmarks.
arXiv Detail & Related papers (2024-04-29T17:00:53Z)
When to Trust LLMs: Aligning Confidence with Response Quality [49.371218210305656]
We propose CONfidence-Quality-ORDer-preserving alignment approach (CONQORD) It integrates quality reward and order-preserving alignment reward functions. Experiments demonstrate that CONQORD significantly improves the alignment performance between confidence and response accuracy.
arXiv Detail & Related papers (2024-04-26T09:42:46Z)
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression [109.23761449840222]
This study conducts the first, thorough evaluation of leading Large Language Models (LLMs) We find that quantization is currently a more effective approach than pruning in achieving efficiency and trustworthiness simultaneously.
arXiv Detail & Related papers (2024-03-18T01:38:19Z)
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer. This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.