Related papers: Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

URL: http://arxiv.org/abs/2402.14848v2
Date: Wed, 10 Jul 2024 17:01:37 GMT
Title: Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
Authors: Mosh Levy, Alon Jacoby, Yoav Goldberg,
Abstract summary: This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs) We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. We show that the degradation trend appears in every version of our dataset, although at different intensities.
Score: 48.35385912526338
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that the traditional metric of next word prediction correlates negatively with performance of LLMs' on our reasoning dataset. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

Related papers

Towards Long Context Hallucination Detection [49.195854802543714]
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. They are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. We propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations.
arXiv Detail & Related papers (2025-04-28T03:47:05Z)
Time's Up! An Empirical Study of LLM Reasoning Ability Under Output Length Constraint [20.685932824324446]
We investigate whether and how the reasoning abilities of Large Language Models (LLMs) remain effective under real-world latency constraints. Specifically, we test more than 25 LLMs on common reasoning datasets under a wide range of output length budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning.
arXiv Detail & Related papers (2025-04-19T16:32:28Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows [17.088307683654577]
This study examines the impact of tokenized Java code length on the accuracy and explicitness of ten major LLMs in vulnerability detection. We found inconsistencies across models: some, like GPT-4, Mistral, and Mixtral, showed robustness, while others exhibited a significant link between tokenized length and performance.
arXiv Detail & Related papers (2025-01-30T20:44:46Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models [62.698520962933195]
Large Vision-Language Models (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning. We propose a novel training-free context pruning method that selectively removes less critical textual information.
arXiv Detail & Related papers (2024-10-25T17:59:09Z)
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models [59.970391602080205]
This study investigates whether such constraints on generation space impact LLMs abilities, including reasoning and domain knowledge comprehension. We evaluate LLMs performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. We find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
arXiv Detail & Related papers (2024-08-05T13:08:24Z)
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost [4.299153274884264]
This paper analyzes the impact of output lengths on large language models (LLMs) inference pipelines. It proposes novel metrics to evaluate them in terms of textitcorrect conciseness. It also examines the impact of controlling output length through a refined prompt engineering strategy, Constrained-CoT.
arXiv Detail & Related papers (2024-07-29T09:21:52Z)
Learning on Graphs with Large Language Models(LLMs): A Deep Dive into Model Robustness [39.57155321515097]
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing tasks. It remains unclear whether LLMs exhibit robustness in learning on graphs.
arXiv Detail & Related papers (2024-07-16T09:05:31Z)
Adapting Large Language Models for Document-Level Machine Translation [46.370862171452444]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks. Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning. This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs.
arXiv Detail & Related papers (2024-01-12T09:29:13Z)
MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models [17.322480769274062]
Large language models (LLMs) have shown nearly saturated performance on many natural language processing (NLP) tasks. This paper constructs Multiple Sensitive Factors Time QA (MenatQA) with total 2,853 samples for evaluating the time comprehension and reasoning abilities of LLMs.
arXiv Detail & Related papers (2023-10-08T13:19:52Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Evaluating the Capability of Large-scale Language Models on Chinese Grammatical Error Correction Task [10.597024796304016]
Large-scale language models (LLMs) has shown remarkable capability in various of Natural Language Processing (NLP) tasks. This report explores the how large language models perform on Chinese grammatical error correction tasks.
arXiv Detail & Related papers (2023-07-08T13:10:59Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.