Long-Range Tasks Using Short-Context LLMs: Incremental Reasoning With Structured Memories
- URL: http://arxiv.org/abs/2412.18914v1
- Date: Wed, 25 Dec 2024 14:14:31 GMT
- Title: Long-Range Tasks Using Short-Context LLMs: Incremental Reasoning With Structured Memories
- Authors: Dulhan Jayalath, James Bradley Wendt, Nicholas Monath, Sandeep Tata, Beliz Gunel,
- Abstract summary: We present PRISM, which alleviates concerns by processing information as a stream of chunks, maintaining a structured in-context memory.<n>This approach demonstrates superior performance to baselines on diverse tasks while using at least 4x smaller contexts.<n>It achieves 54% cost reduction when compared to alternative short-context approaches.
- Score: 12.133230897181594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-range tasks require reasoning over long inputs. Existing solutions either need large compute budgets, training data, access to model weights, or use complex, task-specific approaches. We present PRISM, which alleviates these concerns by processing information as a stream of chunks, maintaining a structured in-context memory specified by a typed hierarchy schema. This approach demonstrates superior performance to baselines on diverse tasks while using at least 4x smaller contexts than long-context models. Moreover, PRISM is token-efficient. By producing short outputs and efficiently leveraging key-value (KV) caches, it achieves up to 54% cost reduction when compared to alternative short-context approaches. The method also scales down to tiny information chunks (e.g., 500 tokens) without increasing the number of tokens encoded or sacrificing quality. Furthermore, we show that it is possible to generate schemas to generalize our approach to new tasks with minimal effort.
Related papers
- Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models [11.012474205717178]
Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation.<n>This paper introduces a novel semantic caching approach for storing and reusing contextual summaries.<n>Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing.
arXiv Detail & Related papers (2025-05-16T14:04:31Z) - Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation [15.975325252309554]
We introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of Large Language Models.
Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data.
We demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench.
arXiv Detail & Related papers (2025-04-17T04:46:57Z) - Cost-Optimal Grouped-Query Attention for Long-Context LLMs [64.90662568387683]
Building effective Transformer-based large language models (LLMs) has recently become a research focus.
We compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost.
Our studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.
arXiv Detail & Related papers (2025-03-12T17:50:42Z) - WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale [86.25450054683172]
WildLong extracts meta-information from real user queries to produce scalable data.
It supports multi-document reasoning, such as cross-document comparison and aggregation.
It surpasses existing open-source long-context-optimized models across benchmarks.
arXiv Detail & Related papers (2025-02-23T18:59:09Z) - An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long texts.<n>Their performance significantly degrades when it comes to numerical calculations in the long-context.<n>We propose a workflow which decomposes a numeric-involved long-context task into 4 low-level subtasks.<n>The results in 2 numeric-involved long-context benchmarks demonstrate our workflow can not only improve accuracy, but also significantly reduce the cost of API calls.
arXiv Detail & Related papers (2024-11-15T12:39:02Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts.
LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA.
Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z) - Instance-wise Prompt Tuning for Pretrained Language Models [72.74916121511662]
Instance-wise Prompt Tuning (IPT) is the first prompt learning paradigm that injects knowledge from the input data instances to the prompts.
IPT significantly outperforms task-based prompt learning methods, and achieves comparable performance to conventional finetuning with only 0.5% - 1.5% of tuned parameters.
arXiv Detail & Related papers (2022-06-04T10:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.