Related papers: DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

URL: http://arxiv.org/abs/2601.16478v1
Date: Fri, 23 Jan 2026 06:19:08 GMT
Title: DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering
Authors: Haotian Chen, Qingqing Long, Siyu Pu, Xiao Luo, Wei Ju, Meng Xiao, Yuanchun Zhou, Jianghua Zhao, Xuezhi Wang,
Abstract summary: We propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning.<n>This work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
Score: 28.427433335623217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.

Related papers

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights [63.32178443510396]
We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
arXiv Detail & Related papers (2026-02-02T23:21:13Z)
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.3527268311731]
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM)<n>We operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.<n>Our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
arXiv Detail & Related papers (2025-12-18T12:44:36Z)
SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature [52.36039386997026]
We introduce SciRAG, an open-source framework for scientific literature exploration.<n>We introduce three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution.
arXiv Detail & Related papers (2025-11-18T11:09:19Z)
SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA [35.02813727925432]
Large language models (LLMs) show promise in solving scientific problems.<n>They can help generate long-form answers for scientific questions.<n>LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering.
arXiv Detail & Related papers (2025-09-29T20:07:00Z)
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning [53.82037883518254]
We introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks.<n>We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks.
arXiv Detail & Related papers (2025-08-26T17:04:23Z)
SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs [42.88264147977551]
We present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench) for evaluating rerankers within RAG-LLMs systems.<n>To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs.
arXiv Detail & Related papers (2025-08-12T08:36:23Z)
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z)
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\ ightarrow$ Evidence Reasoning [6.043212666944194]
We present CLAIM-BENCH, a benchmark for evaluating large language models' capabilities in scientific claim-evidence extraction and validation.<n>We show that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall.<n> strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims.
arXiv Detail & Related papers (2025-06-09T21:04:39Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
WithdrarXiv: A Large-Scale Dataset for Retraction Study [33.782357627001154]
We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv.<n>We develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations.<n>We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96.
arXiv Detail & Related papers (2024-12-04T23:36:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.