Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall
- URL: http://arxiv.org/abs/2404.16164v1
- Date: Wed, 24 Apr 2024 19:40:01 GMT
- Title: Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall
- Authors: Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang,
- Abstract summary: Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks.
We focus on assessing LLMs' ability to recall factual knowledge learned from pretraining.
We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses.
- Score: 31.45796499298925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.
Related papers
- Estimating Knowledge in Large Language Models Without Generating a Single Token [12.913172023910203]
We study whether it is possible to estimate how knowledgeable a model is about a certain entity, only from its internal computation.
Experiments show that KEEN, a simple probe trained over internal subject representations, succeeds at both tasks.
Being simple and lightweight, KEEN can be leveraged to identify gaps and clusters of entity knowledge in large language models.
arXiv Detail & Related papers (2024-06-18T14:45:50Z) - Limited Out-of-Context Knowledge Reasoning in Large Language Models [65.72847298578071]
Large Language Models (LLMs) have demonstrated strong capabilities as knowledge bases and significant in-context reasoning capabilities.
This paper focuses on a significant facet of out-of-context reasoning: Out-of-context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge.
arXiv Detail & Related papers (2024-06-11T15:58:59Z) - Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? [33.702498916775426]
We study the impact of new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge.
We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning.
As the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate.
arXiv Detail & Related papers (2024-05-09T17:00:22Z) - Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws [51.68385617116854]
Scaling laws describe the relationship between the size of language models and their capabilities.
We focus on factual knowledge represented as domains, such as (USA, capital, Washington D.C.) from a Wikipedia page.
A 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined.
arXiv Detail & Related papers (2024-04-08T11:11:31Z) - Will the Real Linda Please Stand up...to Large Language Models? Examining the Representativeness Heuristic in LLMs [7.100094213474042]
Large language models (LLMs) have demonstrated remarkable proficiency in modeling text and generating human-like text.
LLMs may be susceptible to a common cognitive trap in human decision-making called the representativeness.
This research investigates the impact of the representativeness on LLM reasoning.
arXiv Detail & Related papers (2024-04-01T20:15:06Z) - Hallucinations or Attention Misdirection? The Path to Strategic Value
Extraction in Business Using Large Language Models [0.0]
This paper defines attention misdirection rather than true hallucinations.
This paper highlights the best practices of the PGI, Persona, Grouping, and Intelligence, method.
arXiv Detail & Related papers (2024-02-21T18:40:24Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Improving the Reliability of Large Language Models by Leveraging
Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination"
We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information while acquiring new knowledge.
This study empirically evaluates the forgetting phenomenon in large language models (LLMs) during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.