Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection
- URL: http://arxiv.org/abs/2505.12306v1
- Date: Sun, 18 May 2025 08:39:05 GMT
- Title: Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection
- Authors: Yuwei Zhang, Wenhao Yu, Shangbin Feng, Yifan Zhu, Letian Peng, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang,
- Abstract summary: We introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention.<n>We propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's "Did You Know..." entries.<n>WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly with future updates from Wikipedia editors.
- Score: 48.188285483378664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant advances in large language models (LLMs), their knowledge memorization capabilities remain underexplored, due to the lack of standardized and high-quality test ground. In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention. Specifically, we propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's "Did You Know..." entries. These entries are carefully selected by expert Wikipedia editors based on criteria such as verifiability and clarity. Each entry is converted into multiple question-answer pairs spanning diverse task formats from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly extensible with future updates from Wikipedia editors. Extensive experiments using continued pre-training reveal a surprising insight: despite their prevalence in modern LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge memorization capabilities compared to Bidirectional Language Models (BiLMs), exhibiting a 23% lower accuracy in terms of reliability. To compensate for the smaller scales of current BiLMs, we introduce a modular collaborative framework utilizing ensembles of BiLMs as external knowledge repositories to integrate with LLMs. Experiment shows that our framework further improves the reliability accuracy by up to 29.1%.
Related papers
- Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs [0.0]
Large language models (LLMs) exhibit remarkable capabilities in question answering and reasoning.<n> Updating this knowledge typically requires costly and brittle re-training.<n>We propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training.
arXiv Detail & Related papers (2025-06-08T20:13:33Z) - Enhancing LLM Knowledge Learning through Generalization [73.16975077770765]
We show that an LLM's ability to continually predict the same factual knowledge tokens given diverse paraphrased contexts is positively correlated with its capacity to extract that knowledge via question-answering.<n>We propose two strategies to enhance LLMs' ability to predict the same knowledge tokens given varied contexts, thereby enhancing knowledge acquisition.
arXiv Detail & Related papers (2025-03-05T17:56:20Z) - Self-Memory Alignment: Mitigating Factual Hallucinations with Generalized Improvement [37.59724553583446]
Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in factual hallucinations.<n>We introduce self-memory alignment (SMA), which fine-tunes the model on self-generated responses to precise and simple factual questions.<n>Extensive experiments show that SMA significantly improves LLMs' overall performance, with consistent enhancement across various benchmarks concerning factuality, as well as helpfulness and comprehensive skills.
arXiv Detail & Related papers (2025-02-26T13:34:52Z) - Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory [15.986679553468989]
Large language models (LLMs) have shown promise as potential knowledge bases.<n>LLMs often struggle with question-answering tasks and are prone to hallucinations.<n>We develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge.
arXiv Detail & Related papers (2024-12-30T10:29:18Z) - EvoWiki: Evaluating LLMs on Evolving Knowledge [72.92365627254063]
EvoWiki is an evolving dataset designed to reflect knowledge evolution by categorizing information into stable, evolved, and uncharted states.<n>Our results indicate that current models often struggle with evolved knowledge, frequently providing outdated or incorrect responses.<n>EvoWiki provides a robust benchmark for advancing future research on the knowledge evolution capabilities of large language models.
arXiv Detail & Related papers (2024-12-18T08:04:57Z) - Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts [50.06633829833144]
Large Language Models (LLMs) are effective in performing various NLP tasks, but struggle to handle tasks that require extensive, real-world knowledge.
We propose a benchmark that requires knowledge of long-tail facts for answering the involved questions.
Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required.
arXiv Detail & Related papers (2024-05-10T15:10:20Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models [46.079902719883414]
We propose Knowledge Card, a modular framework to plug in new factual and relevant knowledge into general-purpose language models.
We first introduce knowledge cards -- specialized language models trained on corpora from specific domains and sources.
We then propose three content selectors to dynamically select and retain information in documents generated by knowledge cards.
arXiv Detail & Related papers (2023-05-17T05:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.