HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights
- URL: http://arxiv.org/abs/2505.04846v1
- Date: Wed, 07 May 2025 22:50:23 GMT
- Title: HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights
- Authors: Ozan Gokdemir, Carlo Siebenschuh, Alexander Brace, Azton Wells, Brian Hsu, Kyle Hippe, Priyanka V. Setty, Aswathy Ajith, J. Gregory Pauloski, Varuni Sastry, Sam Foreman, Huihuo Zheng, Heng Ma, Bharat Kale, Nicholas Chia, Thomas Gibbs, Michael E. Papka, Thomas Brettin, Francis J. Alexander, Anima Anandkumar, Ian Foster, Rick Stevens, Venkatram Vishwanath, Arvind Ramanathan,
- Abstract summary: HiPerRAG is a workflow to index and retrieve knowledge from more than 3.6 million scientific articles.<n>At its core are Oreo, a high- throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm.<n>HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work.
- Score: 72.82973609312178
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
Related papers
- HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z) - AInsteinBench: Benchmarking Coding Agents on Scientific Repositories [33.48206557020983]
AInsteinBench is a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents.<n>AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
arXiv Detail & Related papers (2025-12-24T08:11:11Z) - Executable Knowledge Graphs for Replicating AI Research [65.41207324831583]
Executable Knowledge Graphs (xKG) is a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature.<n>Code will released at https://github.com/zjunlp/xKG.
arXiv Detail & Related papers (2025-10-20T17:53:23Z) - TusoAI: Agentic Optimization for Scientific Methods [16.268579802762247]
Large language models (LLMs) have demonstrated strong capabilities in synthesizing literature, reasoning with empirical data, and generating domain-specific code.<n>Here, we introduce TusoAI, an agentic AI system that takes a scientific task description with an evaluation function.<n>TusoAI integrates domain knowledge into a knowledge tree representation and performs iterative, domain-specific optimization and model diagnosis.
arXiv Detail & Related papers (2025-09-28T17:30:44Z) - SciGPT: A Large Language Model for Scientific Literature Understanding and Knowledge Discovery [3.779883844533933]
This paper presents SciGPT, a domain-adapted model for scientific literature understanding and ScienceBench, an open source benchmark tailored to evaluate scientific LLMs.<n>Built on the Qwen3 architecture, SciGPT incorporates three key innovations: (1) low-cost domain distillation via a two-stage pipeline to balance performance and efficiency; (2) a Sparse Mixture-of-Experts attention mechanism that cuts memory consumption by 55% for 32,000 long-token reasoning; and (3) knowledge-aware adaptation integrating domain-specific nuances.<n> Experimental results on ScienceBench show that SciGPT outperforms GPT-4o in core scientific tasks including sequence
arXiv Detail & Related papers (2025-09-09T16:09:19Z) - ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research [15.983924435685553]
We develop ScIRGen, a dataset generation framework for scientific QA & retrieval.<n>We use it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers.
arXiv Detail & Related papers (2025-06-09T11:47:13Z) - ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows [82.07367406991678]
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
arXiv Detail & Related papers (2025-05-26T12:27:27Z) - Benchmarking Retrieval-Augmented Generation for Chemistry [28.592844362931853]
Retrieval-augmented generation is a framework for enhancing large language models with external knowledge.<n>ChemRAG-Bench is a benchmark designed to assess the effectiveness of RAG across a diverse set of chemistry-related tasks.<n>ChemRAG-Toolkit is a modular toolkit that supports five retrieval algorithms and eight LLMs.
arXiv Detail & Related papers (2025-05-12T15:34:45Z) - Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents [11.74019905854637]
Large language models (LLMs) are evolving into scientific agents that automate critical tasks.<n>Unlike general-purpose LLMs, specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms.<n>We highlight why they differ from general agents and the ways in which they advance research across various scientific fields.
arXiv Detail & Related papers (2025-03-31T13:11:28Z) - GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation [84.41557981816077]
We introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation.<n>GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships.<n>It achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws.
arXiv Detail & Related papers (2025-02-03T07:04:29Z) - Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field [0.0]
This paper offers an analysis of the ability of large models to identify semantic relationships between different research topics.<n>We developed a gold standard based on the IEEE Thesaurus to evaluate the task.<n>Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral, and Claude 3-7B.
arXiv Detail & Related papers (2024-12-11T10:11:41Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - The Efficiency Spectrum of Large Language Models: An Algorithmic Survey [54.19942426544731]
The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains.
This paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs.
arXiv Detail & Related papers (2023-12-01T16:00:25Z) - Automated Creation and Human-assisted Curation of Computable Scientific
Models from Code and Text [2.3746609573239756]
Domain experts cannot gain a complete understanding of the implementation of a scientific model if they are not familiar with the code.
We develop a system for the automated creation and human-assisted curation of scientific models.
We present experimental results obtained using a dataset of code and associated text derived from NASA's Hypersonic Aerodynamics website.
arXiv Detail & Related papers (2022-01-28T17:31:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.