Related papers: WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

URL: http://arxiv.org/abs/2505.20249v1
Date: Mon, 26 May 2025 17:23:29 GMT
Title: WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models
Authors: Yongan Yu, Qingchen Hu, Xianda Du, Jiayin Wang, Fengran Mo, Renee Sieber,
Abstract summary: WXImpactBench is the first benchmark for evaluating the capacity of large language models (LLMs) on disruptive weather impacts.<n>The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
Score: 3.9711303420034443
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.

Related papers

Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change [2.8680187920555635]
Climate-Eval aggregates existing datasets along with a newly developed news classification dataset.<n>This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse.
arXiv Detail & Related papers (2025-05-24T11:45:46Z)
Interpretable Dual-Stream Learning for Local Wind Hazard Prediction in Vulnerable Communities [1.9299285312415735]
Wind hazards such as tornadoes and straight-line winds frequently affect vulnerable communities in the Great Plains of the United States.<n>Existing forecasting systems focus primarily on meteorological elements and often fail to capture community-specific vulnerabilities.<n>We propose a dual-stream learning framework that integrates structured numerical weather data with unstructured textual event narratives.<n>Our architecture combines a Random Forest and RoBERTa-based transformer through a late fusion mechanism, enabling robust and context-aware wind hazard prediction.
arXiv Detail & Related papers (2025-05-20T15:46:02Z)
Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network [47.699409089023696]
Air quality prediction plays a crucial role in public health and environmental protection.<n>Existing works underestimate the critical role atmospheric conditions in air quality prediction.<n> MDSTNet is an encoder framework explicitly that captures atmosphere-pollution dependencies for prediction.<n>ChinaAirNet is the first dataset combining air quality records with multi-pressure-level meteorological observations.
arXiv Detail & Related papers (2025-04-14T09:18:11Z)
ClimateBench-M: A Multi-Modal Climate Data Benchmark with a Simple Generative Method [61.76389719956301]
We contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns time series climate data from ERA5, extreme weather events data from NOAA, and satellite image data from NASA.<n>Under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks.
arXiv Detail & Related papers (2025-04-10T02:22:23Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)<n>Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z)
CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting [28.560095276214543]
We propose Weather and Climate Event Forecasting (WCEF) to predict weather and climate events.<n>CLLMate is the first dataset for WCEF using 26,156 environmental news articles with ERA5 reanalysis data.<n>We systematically benchmark 23 existing MLLMs on CLLMate, including closed-source, open-source, and our fine-tuned models.
arXiv Detail & Related papers (2024-09-27T18:00:13Z)
Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather [21.040167521248772]
Existing LiDAR semantic segmentation methods often struggle with performance declines in adverse weather conditions. Previous work has addressed this issue by simulating adverse weather or employing universal data augmentation during training. We propose new strategic data augmentation techniques to pinpoint the main causes of performance degradation. Our method achieves a notable 39.5 mIoU on the Semantic KITTI-to-SemanticSTF benchmark, improving the baseline by 8.1%p and establishing a new state-of-the-art.
arXiv Detail & Related papers (2024-07-02T14:19:51Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks. We propose to automate dataset updating and provide systematic analysis regarding its effectiveness. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z)
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction [131.7684896032888]
We present TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains. We evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance.
arXiv Detail & Related papers (2023-11-16T04:43:03Z)
AB2CD: AI for Building Climate Damage Classification and Detection [0.0]
We explore the implementation of deep learning techniques for precise building damage assessment in the context of natural hazards. We tackle the challenges of generalization to novel disasters and regions while accounting for the influence of low-quality and noisy labels. Our research findings showcase the potential and limitations of advanced AI solutions in enhancing the impact assessment of climate change-induced extreme weather events.
arXiv Detail & Related papers (2023-09-03T03:37:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.