ScienceDB AI: An LLM-Driven Agentic Recommender System for Large-Scale Scientific Data Sharing Services
- URL: http://arxiv.org/abs/2601.01118v1
- Date: Sat, 03 Jan 2026 08:42:53 GMT
- Title: ScienceDB AI: An LLM-Driven Agentic Recommender System for Large-Scale Scientific Data Sharing Services
- Authors: Qingqing Long, Haotian Chen, Chenyang Zhao, Xiaolei Du, Xuezhi Wang, Pengyao Wang, Chengzan Li, Yuanchun Zhou, Hengshu Zhu,
- Abstract summary: We present ScienceDB AI, a novel agentic recommender system developed on Science Data Bank (ScienceDB)<n>ScienceDB AI leverages natural language conversations and deep reasoning to accurately recommend datasets aligned with researchers' scientific intents.<n>The Trustworthy RAG provides citable references via Citable Task Record (CSTR) identifiers, enhancing recommendation and trustworthiness.
- Score: 36.35068691076956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of AI for Science (AI4S) has underscored the significance of scientific datasets, leading to the establishment of numerous national scientific data centers and sharing platforms. Despite this progress, efficiently promoting dataset sharing and utilization for scientific research remains challenging. Scientific datasets contain intricate domain-specific knowledge and contexts, rendering traditional collaborative filtering-based recommenders inadequate. Recent advances in Large Language Models (LLMs) offer unprecedented opportunities to build conversational agents capable of deep semantic understanding and personalized recommendations. In response, we present ScienceDB AI, a novel LLM-driven agentic recommender system developed on Science Data Bank (ScienceDB), one of the largest global scientific data-sharing platforms. ScienceDB AI leverages natural language conversations and deep reasoning to accurately recommend datasets aligned with researchers' scientific intents and evolving requirements. The system introduces several innovations: a Scientific Intention Perceptor to extract structured experimental elements from complicated queries, a Structured Memory Compressor to manage multi-turn dialogues effectively, and a Trustworthy Retrieval-Augmented Generation (Trustworthy RAG) framework. The Trustworthy RAG employs a two-stage retrieval mechanism and provides citable dataset references via Citable Scientific Task Record (CSTR) identifiers, enhancing recommendation trustworthiness and reproducibility. Through extensive offline and online experiments using over 10 million real-world datasets, ScienceDB AI has demonstrated significant effectiveness. To our knowledge, ScienceDB AI is the first LLM-driven conversational recommender tailored explicitly for large-scale scientific dataset sharing services. The platform is publicly accessible at: https://ai.scidb.cn/en.
Related papers
- SciDER: Scientific Data-centric End-to-end Researcher [9.796056249900785]
SciDER is a data-centric end-to-end system that automates the research lifecycle.<n>Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data.<n>We also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research.
arXiv Detail & Related papers (2026-03-02T03:53:20Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [32.21228080662089]
We present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level textbooks.<n>We introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances.<n>Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths.
arXiv Detail & Related papers (2025-07-22T17:59:03Z) - Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team [53.38438460574943]
IDVSCI is a multi-agent framework built on large language models (LLMs)<n>It incorporates two key innovations: a Dynamic Knowledge Exchange mechanism and a Dual-Diversity Review paradigm.<n>Results show that IDVSCI consistently achieves the best performance across two datasets.
arXiv Detail & Related papers (2025-06-23T07:12:08Z) - Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System [62.832818186789545]
Virtual Scientists (VirSci) is a multi-agent system designed to mimic the teamwork inherent in scientific research.<n>VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas.<n>We show that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z) - Pennsieve: A Collaborative Platform for Translational Neuroscience and Beyond [0.5130659559809153]
Pennsieve is an open-source, cloud-based scientific data management platform.
It supports complex multimodal datasets and provides tools for data visualization and analyses.
Pennsieve stores over 125 TB of scientific data, with 35 TB of data publicly available across more than 350 high-impact datasets.
arXiv Detail & Related papers (2024-09-16T17:55:58Z) - Modeling Information Change in Science Communication with Semantically
Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change.
SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.
Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.