SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs
- URL: http://arxiv.org/abs/2508.08742v2
- Date: Wed, 24 Sep 2025 07:37:00 GMT
- Title: SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs
- Authors: Haotian Chen, Qingqing Long, Meng Xiao, Xiao Luo, Wei Ju, Chengrui Wang, Xuezhi Wang, Yuanchun Zhou, Hengshu Zhu,
- Abstract summary: We present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench) for evaluating rerankers within RAG-LLMs systems.<n>To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs.
- Score: 42.88264147977551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.
Related papers
- DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering [28.427433335623217]
We propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning.<n>This work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
arXiv Detail & Related papers (2026-01-23T06:19:08Z) - Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents [52.14392337070763]
We introduce CFG-Bench, a new benchmark designed to systematically evaluate fine-grained action intelligence.<n>CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities.<n>Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions.
arXiv Detail & Related papers (2025-11-24T02:02:29Z) - Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z) - LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models [20.800445482814958]
Large Language Models (LLMs) have gained interest for their potential to leverage embedded scientific knowledge for hypothesis generation.<n>Existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery.<n>In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains.<n>Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR- Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning
arXiv Detail & Related papers (2025-04-14T17:00:13Z) - On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems [5.69361786082969]
Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs)<n>We evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs.<n>Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that.
arXiv Detail & Related papers (2025-02-20T17:34:34Z) - LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation [28.61326111959728]
Large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content.
Our study addresses this knowledge gap by simulating two critical phases of the retrieval-augmented generation (RAG) framework.
Contrary to previous findings, our results reveal no significant self-preference effect in RAG frameworks.
arXiv Detail & Related papers (2024-10-28T08:32:09Z) - Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs)
We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z) - Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy [66.95501113584541]
Utility and topical relevance are critical measures in information retrieval.
We propose an Iterative utiliTy judgmEnt fraMework to promote each step of the cycle of Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-06-17T07:52:42Z) - The Power of Noise: Redefining Retrieval for RAG Systems [19.387105120040157]
Retrieval-Augmented Generation (RAG) has emerged as a method to extend beyond the pre-trained knowledge of Large Language Models.
We focus on the type of passages IR systems within a RAG solution should retrieve.
arXiv Detail & Related papers (2024-01-26T14:14:59Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.