Not All Needles Are Found: How Fact Distribution and Don't Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs
- URL: http://arxiv.org/abs/2601.02023v1
- Date: Mon, 05 Jan 2026 11:30:56 GMT
- Title: Not All Needles Are Found: How Fact Distribution and Don't Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs
- Authors: Amirali Ebrahimzadeh, Seyyed M. Salili,
- Abstract summary: Large language models (LLMs) increasingly support very long input contexts.<n>It remains unclear how reliably they extract and infer information at scale.<n>We study how fact placement, corpus-level fact distributions, and Don't Make It Up prompts influence model behavior.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) increasingly support very long input contexts. Yet it remains unclear how reliably they extract and infer information at scale. Performance varies with context length and strongly interacts with how information is distributed in real-world corpora. Motivated by these observations, we study how fact placement, corpus-level fact distributions, and Don't Make It Up prompts influence model behavior. We introduce an extended needle-in-a-haystack benchmark across four production-scale models: Gemini-2.5-flash, ChatGPT-5-mini, Claude-4.5-haiku, and Deepseek-v3.2-chat. Unlike prior work, we separately evaluate literal extraction, logical inference, and hallucination risk. Our study considers both positional effects and realistic distributions of evidence across long contexts, as well as prompts that explicitly discourage fabrication. We find that longer contexts alone do not guarantee better performance and can be detrimental when relevant evidence is diluted or widely dispersed. Performance varies substantially across models: some show severe degradation under realistic conditions, while others remain more robust at longer context lengths. Anti-hallucination (AH) instructions can make some models overly conservative, sharply reducing accuracy in literal extraction and logical inference. While we do not directly compare retrieval-augmented generation (RAG) and cache-augmented generation (CAG), our results suggest many failures stem from ineffective context utilization. Models often struggle to identify and prioritize relevant information even when it is present. These findings have direct practical implications, as enterprise workflows increasingly involve pasting large volumes of unfiltered documents into LLM prompts. Effective context length and model-specific robustness to long contexts are therefore critical for reliable LLM deployment in research and business.
Related papers
- Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness [6.250095470690937]
We show that large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent.<n>We evaluate both factual knowledge and the impact of evidence placement across varying context lengths.
arXiv Detail & Related papers (2026-02-15T08:15:13Z) - Context Length Alone Hurts LLM Performance Despite Perfect Retrieval [29.523005523787244]
Large language models (LLMs) often fail to scale their performance on long-context tasks in line with the context lengths they support.<n>This paper presents findings that the answer to this question may be negative.
arXiv Detail & Related papers (2025-10-06T21:17:13Z) - LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization [9.410181019585822]
We operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs.<n>We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them.<n>Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context.
arXiv Detail & Related papers (2025-10-05T03:14:05Z) - Positional Biases Shift as Inputs Approach Context Window Limits [57.00239097102958]
The LiM effect is strongest when inputs occupy up to 50% of a model's context window.<n>We observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input.
arXiv Detail & Related papers (2025-08-10T20:40:24Z) - A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z) - Towards Long Context Hallucination Detection [49.195854802543714]
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks.<n>They are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context.<n>We propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations.
arXiv Detail & Related papers (2025-04-28T03:47:05Z) - ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning.<n>We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics.<n>Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z) - CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels [11.614599448394374]
We introduce CNNSum, a multi-scale long-context summarization benchmark based on Chinese novels.<n>CNNSum features human-driven annotations across four subsets totaling 695 samples, with lengths ranging from 16k to 128k.<n>We benchmark numerous LLMs and conduct detailed human assessments to summarize abnormal output types.
arXiv Detail & Related papers (2024-12-03T20:35:57Z) - NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.