Related papers: Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles

Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles

URL: http://arxiv.org/abs/2509.21028v1
Date: Thu, 25 Sep 2025 11:36:09 GMT
Title: Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles
Authors: Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata,
Abstract summary: SciTrek is a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles.<n>Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
Score: 81.89404347890662
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces SciTrek, a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often rely on non-scientific texts, focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by proposing complex questions that require information aggregation and synthesis across multiple full-text scientific articles. Questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (titles, authors, and references). The SQL operations provide explicit, verifiable reasoning steps for fine-grained error analysis, and the construction process scales to contexts up to 1M tokens with minimal supervision. Extensive experiments on a diverse set of open-weight and proprietary LLMs demonstrate that SciTrek poses a significant challenge as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.

Related papers

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning [48.56088080889236]
We introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework.<n>We show that finetuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities.<n>Our findings indicate that memory-augmented agentic architectures offer a more robust alternative.
arXiv Detail & Related papers (2026-01-21T12:52:30Z)
Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts [6.14987671066697]
In this paper, we propose Tagging-Augmented Generation (TAG) as a lightweight data augmentation strategy for long-context scenarios.<n>We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks -- NoLima and NovelQA.<n>We show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline.
arXiv Detail & Related papers (2025-10-27T03:23:25Z)
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z)
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z)
Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study [0.9424565541639368]
We introduce a new benchmark consisting of a curated dataset and a defined evaluation process to assess the compositional reasoning capabilities of large language models within the chemistry domain.<n>Our approach integrates OpenAI reasoning models with named entity recognition (NER) systems to extract chemical entities from recent literature, which are then augmented with external knowledge bases to form a knowledge graph.<n>Our experiments reveal that even state-of-the-art models face significant challenges in multi-hop compositional reasoning.
arXiv Detail & Related papers (2025-04-23T04:36:19Z)
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model.<n>Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones.<n>Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.