Related papers: How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency

Related papers

OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z)
Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness [44.37155305736321]
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs)<n>We propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge.<n>Our framework provides a more realistic and rigorous assessment of unlearning performance.
arXiv Detail & Related papers (2025-06-06T04:35:19Z)
Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know? [27.743708975564214]
We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge.<n>KnowSum estimates the unobserved portion by extrapolating from the appearance of frequencies of observed knowledge instances.<n>Our experiments reveal that a substantial volume of knowledge is omitted when relying solely on observed LLM performance.
arXiv Detail & Related papers (2025-06-01T15:32:44Z)
Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension [14.039653386385519]
Large language models (LLMs) acquire, retain, and apply knowledge. This paper introduces a novel framework, K-(CSA)2, which categorizes LLM knowledge along two dimensions: correctness and confidence.
arXiv Detail & Related papers (2025-01-02T16:34:10Z)
A Survey on LLM-as-a-Judge [20.228675148114245]
Large Language Models (LLMs) have achieved remarkable success across diverse domains. LLMs present a compelling alternative to traditional expert-driven evaluations. This paper addresses the core question: How can reliable LLM-as-a-Judge systems be built?
arXiv Detail & Related papers (2024-11-23T16:03:35Z)
Traditional Methods Outperform Generative LLMs at Forecasting Credit Ratings [17.109522466982476]
Large Language Models (LLMs) have been shown to perform well for many downstream tasks. This paper investigates how well LLMs perform in the task of forecasting corporate credit ratings.
arXiv Detail & Related papers (2024-07-24T20:30:55Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models. It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction [15.534647327246239]
We propose an approach for estimating the latent knowledge embedded inside large language models (LLMs) We leverage the in-context learning abilities of LLMs to estimate the extent to which an LLM knows the facts stored in a knowledge base.
arXiv Detail & Related papers (2024-04-19T15:40:39Z)
Certifying Knowledge Comprehension in LLMs [3.6293956720749425]
We introduce the first specification and certification framework for knowledge comprehension in Large Language Models (LLMs) Instead of a fixed dataset, we design novel specifications that mathematically represent prohibitively large probability distributions of knowledge comprehension prompts with natural noise. We apply our framework to certify SOTA LLMs in two domains: precision medicine and general question-answering.
arXiv Detail & Related papers (2024-02-24T23:16:57Z)
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs [54.22416829200613]
Eva-KELLM is a new benchmark for evaluating knowledge editing of large language models. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results.
arXiv Detail & Related papers (2023-08-19T09:17:19Z)
Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation [109.8527403904657]
We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well. Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries. We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
arXiv Detail & Related papers (2023-07-20T16:46:10Z)
Fairness of ChatGPT and the Role Of Explainable-Guided Prompts [6.079011829257036]
Our research investigates the potential of Large-scale Language Models (LLMs), specifically OpenAI's GPT, in credit risk assessment. Our findings suggest that LLMs, when directed by judiciously designed prompts and supplemented with domain-specific knowledge, can parallel the performance of traditional Machine Learning (ML) models.
arXiv Detail & Related papers (2023-07-14T09:20:16Z)
Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling [34.59678835272862]
ChatGPT, a representative large language model (LLM), has gained considerable attention due to its powerful emergent abilities. This paper proposes to enhance LLMs with knowledge graph-enhanced large language models (KGLLMs) KGLLM provides a solution to enhance LLMs' factual reasoning ability, opening up new avenues for LLM research.
arXiv Detail & Related papers (2023-06-20T12:21:06Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.