Time's Up! An Empirical Study of LLM Reasoning Ability Under Output Length Constraint
- URL: http://arxiv.org/abs/2504.14350v2
- Date: Tue, 22 Apr 2025 13:31:25 GMT
- Title: Time's Up! An Empirical Study of LLM Reasoning Ability Under Output Length Constraint
- Authors: Yi Sun, Han Wang, Jiaqiang Li, Jiacheng Liu, Xiangyu Li, Hao Wen, Huiwen Zheng, Yan Liang, Yuanchun Li, Yunxin Liu,
- Abstract summary: We investigate whether and how the reasoning abilities of Large Language Models (LLMs) remain effective under real-world latency constraints.<n>Specifically, we test more than 25 LLMs on common reasoning datasets under a wide range of output length budgets.<n>The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning.
- Score: 20.685932824324446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making the models think before answering, they are able to achieve much higher accuracy with extra inference computation. However, in many real-world scenarios, models are used under time constraints, where an answer should be given to the user within a certain output length. It is unclear whether and how the reasoning abilities of LLMs remain effective under such constraints. We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test more than 25 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between the token budgets and the actual on-device latency budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning that differ from the unconstrained situation, e.g. the optimal choices of model sizes and prompts change under different budgets. These findings offer practical guidance for users to deploy LLMs under real-world latency constraints.
Related papers
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones.
This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately.
Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z) - Cost-Optimal Grouped-Query Attention for Long-Context LLMs [64.90662568387683]
Building effective Transformer-based large language models (LLMs) has recently become a research focus.<n>We compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost.<n>Our studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.
arXiv Detail & Related papers (2025-03-12T17:50:42Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Large Language Models Can Self-Improve in Long-context Reasoning [100.52886241070907]
Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning.
We propose ours, an approach specifically designed for this purpose.
ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models.
arXiv Detail & Related papers (2024-11-12T19:53:00Z) - Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification? [2.1861408994125253]
Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks.
Recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only.
arXiv Detail & Related papers (2024-10-14T13:10:45Z) - Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning [20.066249913943405]
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors.
We introduce novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios.
Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks.
arXiv Detail & Related papers (2024-06-13T14:31:19Z) - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models [48.35385912526338]
This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs)
We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations.
We show that the degradation trend appears in every version of our dataset, although at different intensities.
arXiv Detail & Related papers (2024-02-19T16:04:53Z) - MenatQA: A New Dataset for Testing the Temporal Comprehension and
Reasoning Abilities of Large Language Models [17.322480769274062]
Large language models (LLMs) have shown nearly saturated performance on many natural language processing (NLP) tasks.
This paper constructs Multiple Sensitive Factors Time QA (MenatQA) with total 2,853 samples for evaluating the time comprehension and reasoning abilities of LLMs.
arXiv Detail & Related papers (2023-10-08T13:19:52Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.