WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
- URL: http://arxiv.org/abs/2510.05336v1
- Date: Mon, 06 Oct 2025 19:58:42 GMT
- Title: WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
- Authors: Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renee Sieber, Fengran Mo,
- Abstract summary: We introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives.<n>WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models can classify societal vulnerability and resilience indicators from extreme weather narratives.
- Score: 15.620758706846388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.
Related papers
- AI Decodes Historical Chinese Archives to Reveal Lost Climate History [82.46757587387704]
We introduce a generative AI framework that inverts the logic of historical chroniclers by inferring the quantitative climate patterns associated with documented events.<n>applied to historical Chinese archives, it produces the sub-annual precipitation reconstruction for southeastern China over the period 1368-1911 AD.<n>Our reconstruction not only quantifies iconic extremes like the Ming Dynasty's Great Drought but also, crucially, maps the full spatial and seasonal structure of El Ni$$o influence on precipitation in this region over five centuries.
arXiv Detail & Related papers (2026-01-30T02:06:13Z) - Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting [3.0354231393746685]
Hierarchical AI-Meteorologist generates explainable weather reports using a hierarchical forecast reasoning and weather keyword generation.<n>Our framework performs multi-scale reasoning across hourly, 6-hour, and daily aggregations to capture both short-term dynamics and long-term trends.
arXiv Detail & Related papers (2025-11-28T17:27:06Z) - Zephyrus: An Agentic Framework for Weather Science [47.611521052984365]
Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems.<n>Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets.<n>We bridge this gap by building a novel agentic framework for weather science.<n>We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops.
arXiv Detail & Related papers (2025-10-05T03:34:08Z) - Respecting Temporal-Causal Consistency: Entity-Event Knowledge Graphs for Retrieval-Augmented Generation [69.45495166424642]
We develop a robust and discriminative QA benchmark to measure temporal, causal, and character consistency understanding in narrative documents.<n>We then introduce Entity-Event RAG (E2RAG), a dual-graph framework that keeps separate entity and event subgraphs linked by a bipartite mapping.<n>Across ChronoQA, our approach outperforms state-of-the-art unstructured and KG-based RAG baselines, with notable gains on causal and character consistency queries.
arXiv Detail & Related papers (2025-06-06T10:07:21Z) - WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models [3.9711303420034443]
WXImpactBench is the first benchmark for evaluating the capacity of large language models (LLMs) on disruptive weather impacts.<n>The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
arXiv Detail & Related papers (2025-05-26T17:23:29Z) - Adversarial Observations in Weather Forecasting [11.130455392128072]
We present a novel attack on autoregressive diffusion models, such as those used in Google's GenCast.<n>The attack introduces subtle perturbations into weather observations that are statistically indistinguishable from natural noise.<n>Our findings highlight a critical security risk with the potential to cause large-scale disruptions and undermine public trust in weather prediction.
arXiv Detail & Related papers (2025-04-22T14:38:13Z) - ClimateBench-M: A Multi-Modal Climate Data Benchmark with a Simple Generative Method [61.76389719956301]
We contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns time series climate data from ERA5, extreme weather events data from NOAA, and satellite image data from NASA.<n>Under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks.
arXiv Detail & Related papers (2025-04-10T02:22:23Z) - WeatherGFM: Learning A Weather Generalist Foundation Model via In-context Learning [69.82211470647349]
We introduce the first generalist weather foundation model (WeatherGFM)<n>It addresses a wide spectrum of weather understanding tasks in a unified manner.<n>Our model can effectively handle up to ten weather understanding tasks, including weather forecasting, super-resolution, weather image translation, and post-processing.
arXiv Detail & Related papers (2024-11-08T09:14:19Z) - Generating Fine-Grained Causality in Climate Time Series Data for Forecasting and Anomaly Detection [67.40407388422514]
We design a conceptual fine-grained causal model named TBN Granger Causality.
Second, we propose an end-to-end deep generative model called TacSas, which discovers TBN Granger Causality in a generative manner.
We test TacSas on climate benchmark ERA5 for climate forecasting and the extreme weather benchmark of NOAA for extreme weather alerts.
arXiv Detail & Related papers (2024-08-08T06:47:21Z) - WeatherQA: Can Multimodal Language Models Reason about Severe Weather? [45.43764278625153]
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year.
This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas.
We introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters and predict severe weather in real-world scenarios.
arXiv Detail & Related papers (2024-06-17T05:23:18Z) - Federated Prompt Learning for Weather Foundation Models on Devices [37.88417074427373]
On-device intelligence for weather forecasting uses local deep learning models to analyze weather patterns without centralized cloud computing.
This paper propose Federated Prompt Learning for Weather Foundation Models on Devices (FedPoD)
FedPoD enables devices to obtain highly customized models while maintaining communication efficiency.
arXiv Detail & Related papers (2023-05-23T16:59:20Z) - Robustness of Object Detectors in Degrading Weather Conditions [7.91378990016322]
State-of-the-art object detection systems for autonomous driving achieve promising results in clear weather conditions.
These systems need to work in degrading weather conditions, such as rain, fog and snow.
Most approaches evaluate only on the KITTI dataset, which consists only of clear weather scenes.
arXiv Detail & Related papers (2021-06-16T13:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.