Related papers: LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

URL: http://arxiv.org/abs/2505.07897v2
Date: Mon, 16 Jun 2025 19:55:19 GMT
Title: LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
Authors: Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, Tatsunori Hashimoto,
Abstract summary: We identify code comprehension and repair as a natural testbed and challenge task for long-context models.<n>We introduce LongCodeBench, a benchmark to test LLM coding abilities in long-context scenarios.<n>We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet.
Score: 32.93947506522558
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.

Related papers

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? [28.694112253150983]
Real-task-based long-context evaluation benchmarks have two major shortcomings.<n> benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability.<n>We introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.
arXiv Detail & Related papers (2025-05-25T19:58:31Z)
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly [55.14191042936519]
Long-context vision-language models (LCVLMs) are capable of handling hundreds of images with interleaved text tokens in a single forward pass.<n> MMLongBench is the first benchmark covering a diverse set of long-context vision-language tasks.
arXiv Detail & Related papers (2025-05-15T17:52:54Z)
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks [74.96182906307654]
This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems.<n>LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories.<n>We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint.
arXiv Detail & Related papers (2024-12-19T18:59:17Z)
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window.<n>We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens.<n>We find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks.
arXiv Detail & Related papers (2024-07-19T17:35:47Z)
NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context? [43.98513461616172]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models [45.892014195594314]
We present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels.
arXiv Detail & Related papers (2024-03-06T07:43:43Z)
BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. We propose BAMBOO, a multi-task long context benchmark. It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.