LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
- URL: http://arxiv.org/abs/2505.07897v2
- Date: Mon, 16 Jun 2025 19:55:19 GMT
- Title: LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
- Authors: Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, Tatsunori Hashimoto,
- Abstract summary: We identify code comprehension and repair as a natural testbed and challenge task for long-context models.<n>We introduce LongCodeBench, a benchmark to test LLM coding abilities in long-context scenarios.<n>We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet.
- Score: 32.93947506522558
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.
Related papers
- 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? [28.694112253150983]
Real-task-based long-context evaluation benchmarks have two major shortcomings.<n> benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability.<n>We introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.
arXiv Detail & Related papers (2025-05-25T19:58:31Z) - MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly [55.14191042936519]
Long-context vision-language models (LCVLMs) are capable of handling hundreds of images with interleaved text tokens in a single forward pass.<n> MMLongBench is the first benchmark covering a diverse set of long-context vision-language tasks.
arXiv Detail & Related papers (2025-05-15T17:52:54Z) - LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks [74.96182906307654]
This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems.<n>LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories.<n>We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint.
arXiv Detail & Related papers (2024-12-19T18:59:17Z) - ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window.<n>We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens.<n>We find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks.
arXiv Detail & Related papers (2024-07-19T17:35:47Z) - NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context? [43.98513461616172]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models [45.892014195594314]
We present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs.
CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels.
arXiv Detail & Related papers (2024-03-06T07:43:43Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.