Related papers: How Effective Is Self-Consistency for Long-Context Problems?

Related papers

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models [16.12256921806929]
Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks.<n>As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach.<n>We systematically investigate how SFT data length influences LLM behavior on short-context tasks.<n>We find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining.
arXiv Detail & Related papers (2025-09-23T07:55:38Z)
Long-Short Alignment for Effective Long-Context Modeling in LLMs [32.13785291956956]
Large language models (LLMs) have exhibited impressive performance and surprising emergent properties.<n>Length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem.<n>We highlight the critical role of textbflong-short alignment -- the consistency of output distributions across sequences of varying lengths.
arXiv Detail & Related papers (2025-06-13T13:25:39Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision [40.63870977649693]
Chain-of-Thought prompting has shown promise for multi-step reasoning, but its effectiveness for long-context scenarios remains underexplored. We propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios.
arXiv Detail & Related papers (2025-02-28T07:15:12Z)
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization [49.37607974207405]
LongPO harnesses short-to-long preference data to transfer short-context capabilities to long-context tasks. LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks.
arXiv Detail & Related papers (2025-02-19T17:59:03Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models [62.698520962933195]
Large Vision-Language Models (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning. We propose a novel training-free context pruning method that selectively removes less critical textual information.
arXiv Detail & Related papers (2024-10-25T17:59:09Z)
FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding [32.197113821638936]
We propose a novel integrated Long-Context Large Language Model (FltLM) FltLM incorporates a context filter with a soft mask mechanism, identifying and dynamically excluding irrelevant content to concentrate on pertinent information. Experimental results demonstrate that FltLM significantly outperforms supervised fine-tuning and retrieval-based methods in complex QA scenarios.
arXiv Detail & Related papers (2024-10-09T13:47:50Z)
Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps [21.811552521030137]
Long-context language models (LCLMs) characterized by their extensive context window are becoming popular.<n>Despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases.<n>We find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts.
arXiv Detail & Related papers (2024-10-06T09:29:19Z)
A Controlled Study on Long Context Extension and Generalization in LLMs [85.4758128256142]
Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data.
arXiv Detail & Related papers (2024-09-18T17:53:17Z)
Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack [33.178008350124315]
We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn from a sequence of language tasks through in-context learning (ICL) We introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilize contexts in Lifelong ICL.
arXiv Detail & Related papers (2024-07-23T17:57:41Z)
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs) This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
LooGLE: Can Long-Context Language Models Understand Long Contexts? [46.143956498529796]
LooGLE is a benchmark for large language models' long context understanding. It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering [4.1372815372396525]
This paper proposes a method called textitSelective Context that employs self-information to filter out less informative content. We demonstrate the effectiveness of our approach on tasks of summarisation and question answering across different data sources.
arXiv Detail & Related papers (2023-04-24T13:55:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.