CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision
- URL: http://arxiv.org/abs/2506.14912v1
- Date: Tue, 17 Jun 2025 18:44:21 GMT
- Title: CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision
- Authors: Dyah Adila, Shuai Zhang, Boran Han, Bonan Min, Yuyang Wang,
- Abstract summary: CrEst is a weakly supervised framework for assessing the credibility of context documents during inference.<n>Experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines.
- Score: 15.604947362541415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference--without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.
Related papers
- Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems [94.9141394384021]
Individual agents in multi-agent systems often lack robustness, tending to blindly conform to misleading peers.<n>We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability.<n>We first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input.<n>We then develop Epistemic Context Learning (ECL), a reasoning framework that conditions predictions on explicitly-built peer profiles from history.
arXiv Detail & Related papers (2026-01-29T13:59:32Z) - ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability [23.70973331911138]
We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability.<n>We propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to improve interpretability.<n>Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces.
arXiv Detail & Related papers (2025-10-10T07:08:44Z) - DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding [66.07724324530844]
We propose DocThinker, a rule-based Reinforcement Learning framework for dynamic inference-time reasoning.<n>Our method mitigates catastrophic forgetting and enhances both adaptability and transparency.<n>Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding.
arXiv Detail & Related papers (2025-08-12T03:06:55Z) - LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z) - Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints [18.10515528600634]
We propose textbfDeliberative Searcher, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering.<n>The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforcement learning algorithm that optimize for accuracy under a soft reliability constraint.
arXiv Detail & Related papers (2025-07-22T16:09:34Z) - Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z) - Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models [15.158475816860427]
Uncertainty is essential for assessing the reliability and trustworthiness of modern AI systems.<n> verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution.<n>However, its effectiveness in vision-language models (VLMs) remains insufficiently studied.
arXiv Detail & Related papers (2025-05-26T17:16:36Z) - Document Attribution: Examining Citation Relationships using Large Language Models [62.46146670035751]
We propose a zero-shot approach that frames attribution as a straightforward textual entailment task.<n>We also explore the role of the attention mechanism in enhancing the attribution process.
arXiv Detail & Related papers (2025-05-09T04:40:11Z) - Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions [8.069858557211132]
Large Language Models (LLMs) have shown remarkable capabilities across various tasks.<n>Their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction.<n>This paper introduces a comprehensive framework for evaluating and improving LLM response consistency.
arXiv Detail & Related papers (2025-03-28T11:49:56Z) - Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens [30.083110119139793]
We introduce textbfAbilityLens, a unified benchmark designed to evaluate MLLMs in six key perception abilities.<n>We identify the strengths and weaknesses of current main-stream MLLMs, highlighting stability patterns and revealing a notable performance gap between state-of-the-art open-source and closed-source models.
arXiv Detail & Related papers (2024-11-22T04:41:20Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.<n>FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.<n>Our study reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.<n>The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.<n>We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z) - MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.<n>Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.<n>Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - SAIS: Supervising and Augmenting Intermediate Steps for Document-Level
Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction.
Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.