ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery
- URL: http://arxiv.org/abs/2601.14176v1
- Date: Tue, 20 Jan 2026 17:27:12 GMT
- Title: ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery
- Authors: Youran Sun, Yixin Wen, Haizhao Yang,
- Abstract summary: We introduce textbfReSearch, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery.<n>ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture.<n>Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods.
- Score: 6.780086370528623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results underscore the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.
Related papers
- Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts [0.0]
We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers.<n>Our approach combines large-scale citation-context extraction, schema-guided dataset recognition, and provenance-preserving entity resolution.<n>We release our code, evaluation datasets, and results on GitHub.
arXiv Detail & Related papers (2026-01-08T16:46:06Z) - Intelligent Scientific Literature Explorer using Machine Learning (ISLE) [0.797970449705065]
This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction.<n>The proposed framework contributes a foundation for AI-assisted scientific discovery.
arXiv Detail & Related papers (2025-12-14T16:54:24Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM [19.949137890090814]
We propose an advanced topic discovery method enhanced by large language models (LLMs) to improve scientific topic identification.<n> Specifically, we build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract.<n>We then construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs.<n>Experiments conducted on three real-world datasets demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods.
arXiv Detail & Related papers (2025-08-28T07:55:06Z) - How good are LLMs at Retrieving Documents in a Specific Domain? [3.282961543904818]
We propose an automated method to curate a domain-specific evaluation dataset to analyze the capability of a search system.<n>We incorporate the Retrieval of Augmented Generation (RAG), powered by Large Language Models (LLMs), for high-quality retrieval of environmental domain data using natural language queries.
arXiv Detail & Related papers (2025-08-25T19:47:21Z) - From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z) - ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research [15.983924435685553]
We develop ScIRGen, a dataset generation framework for scientific QA & retrieval.<n>We use it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers.
arXiv Detail & Related papers (2025-06-09T11:47:13Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Large Language Models for Information Retrieval: A Survey [83.75872593741578]
Information retrieval has evolved from term-based methods to its integration with advanced neural models.<n>Recent research has sought to leverage large language models (LLMs) to improve IR systems.<n>We delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers.
arXiv Detail & Related papers (2023-08-14T12:47:22Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - AutoOD: Automated Outlier Detection via Curiosity-guided Search and
Self-imitation Learning [72.99415402575886]
Outlier detection is an important data mining task with numerous practical applications.
We propose AutoOD, an automated outlier detection framework, which aims to search for an optimal neural network model.
Experimental results on various real-world benchmark datasets demonstrate that the deep model identified by AutoOD achieves the best performance.
arXiv Detail & Related papers (2020-06-19T18:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.