PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
- URL: http://arxiv.org/abs/2510.22242v1
- Date: Sat, 25 Oct 2025 10:11:29 GMT
- Title: PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
- Authors: Yutao Wu, Xiao Liu, Yunhao Feng, Jiale Ding, Xingjun Ma,
- Abstract summary: Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated.<n>In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks.<n>We find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature.
- Score: 24.52586571116556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.
Related papers
- SAGE: Benchmarking and Improving Retrieval for Deep Research Agents [60.53966065867568]
We introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.<n>We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval.<n> BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries.
arXiv Detail & Related papers (2026-02-05T18:25:24Z) - OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment [63.662126457336534]
OpenNovelty is an agentic system for transparent, evidence-based novelty analysis.<n>It grounds all assessments in retrieved real papers, ensuring verifiable judgments.<n>OpenNovelty aims to empower the research community with a scalable tool that promotes fair, consistent, and evidence-backed peer review.
arXiv Detail & Related papers (2026-01-04T15:48:51Z) - FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers [10.04850395402571]
The identification and localization of errors is a core task in peer review.<n>Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks.<n>Despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored.
arXiv Detail & Related papers (2025-11-26T19:19:44Z) - LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews [2.2175470459999636]
We identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in systematic reviews.<n>Major weaknesses included a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance.<n>On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built.
arXiv Detail & Related papers (2025-11-16T15:04:50Z) - ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding [49.67493845115009]
ELAIPBench is a benchmark curated by domain experts to evaluate large language models' comprehension of AI research papers.<n>It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval.<n>Experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance.
arXiv Detail & Related papers (2025-10-12T11:11:20Z) - Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation [46.697788643450785]
Large language models (LLMs) have been found to produce outputs that are incomplete or selectively omit key information.<n>In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies.
arXiv Detail & Related papers (2025-10-09T08:22:24Z) - Document Attribution: Examining Citation Relationships using Large Language Models [62.46146670035751]
We propose a zero-shot approach that frames attribution as a straightforward textual entailment task.<n>We also explore the role of the attention mechanism in enhancing the attribution process.
arXiv Detail & Related papers (2025-05-09T04:40:11Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems [99.17123445211115]
We introduce DocBench, a benchmark to evaluate large language model (LLM)-based document reading systems.
Our benchmark involves the recruitment of human annotators and the generation of synthetic questions.
It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.
arXiv Detail & Related papers (2024-07-15T13:17:42Z) - SIFiD: Reassess Summary Factual Inconsistency Detection with LLM [27.392514180175283]
This study reassesses summary inconsistency detection with Large Language Models (LLMs)
We propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.
arXiv Detail & Related papers (2024-03-12T11:41:51Z) - TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications.
This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.