Related papers: CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

URL: http://arxiv.org/abs/2508.15846v1
Date: Tue, 19 Aug 2025 21:56:54 GMT
Title: CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation
Authors: Chenchen Kuai, Chenhao Wu, Yang Zhou, Xiubin Bruce Wang, Tianbao Yang, Zhengzhong Tu, Zihao Li, Yunlong Zhang,
Abstract summary: Port operators need to rapidly synthesize diverse forecast products, such as probabilistic wind maps, into clear, actionable guidance as cyclones approach.<n>We introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat.<n>CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms.
Score: 50.2938859756212
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.

Related papers

R$^2$Energy: A Large-Scale Benchmark for Robust Renewable Energy Forecasting under Diverse and Extreme Conditions [13.981431155407249]
R$2$Energy is a large-scale benchmark for NWP-assisted renewable energy forecasting.<n>It comprises over 10.7 million high-fidelity hourly records from 902 wind and solar stations across four provinces in China.
arXiv Detail & Related papers (2026-02-17T19:22:49Z)
Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z)
AgentCaster: Reasoning-Guided Tornado Forecasting [2.8271273825420606]
AgentCaster is a framework to evaluate Large Language Models (LLMs) on complex, real-world tasks.<n>We assess model performance over a 40-day period featuring diverse historical data, spanning several major tornado outbreaks and including over 500 tornado reports.<n>Human experts significantly outperform state-of-the-art models, which demonstrate a strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poortemporal reasoning in complex, dynamically evolving systems.
arXiv Detail & Related papers (2025-10-02T17:57:16Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
FLP-XR: Future Location Prediction on Extreme Scale Maritime Data in Real-time [0.8937169040399775]
This paper introduces FLP-XR, a model that leverages maritime mobility data to construct a robust framework that offers precise predictions.<n>We demonstrate the efficiency of our approach through an extensive experimental study using three real-world AIS datasets.
arXiv Detail & Related papers (2025-03-10T13:31:42Z)
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights [50.89022445197919]
We propose a speech-specific risk taxonomy, covering 8 risk categories under hostility (malicious sarcasm and threats), malicious imitation (age, gender, ethnicity), and stereotypical biases (age, gender, ethnicity) Based on the taxonomy, we create a small-scale dataset for evaluating current LMMs capability in detecting these categories of risk.
arXiv Detail & Related papers (2024-06-25T10:08:45Z)
Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond [58.39457881271146]
We introduce a novel framework of multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT)<n>Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables.<n>Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution.
arXiv Detail & Related papers (2024-06-03T14:48:53Z)
BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models [52.46248487458641]
Predictive models often need to work with incomplete information in real-world tasks.<n>Current large language models (LLMs) are insufficient for accurate estimations.<n>We propose BIRD, a novel probabilistic inference framework.
arXiv Detail & Related papers (2024-04-18T20:17:23Z)
Global Tropical Cyclone Intensity Forecasting with Multi-modal Multi-scale Causal Autoregressive Model [22.715152977444742]
We propose a Multi-modal, multi-Scale Causal Autogressive model (MSCAR) for global Tropical Cyclone intensity autoregressive forecasting. MSCAR combines causal relationships with large-scale multi-temporal data for global TC intensity autoregressive forecasting. We present the Satellite and ERA5-based Tropical Cyclone dataset (SETCD), which stands as the longest and most comprehensive global dataset related to variables.
arXiv Detail & Related papers (2024-02-16T15:26:33Z)
Residual Corrective Diffusion Modeling for Km-scale Atmospheric Downscaling [58.456404022536425]
State of the art for physical hazard prediction from weather and climate requires expensive km-scale numerical simulations driven by coarser resolution global inputs. Here, a generative diffusion architecture is explored for downscaling such global inputs to km-scale, as a cost-effective machine learning alternative. The model is trained to predict 2km data from a regional weather model over Taiwan, conditioned on a 25km global reanalysis.
arXiv Detail & Related papers (2023-09-24T19:57:22Z)
Multiscale Causal Structure Learning [26.66862801441497]
This paper exposes a novel method, named Multiscale-Causal Learning Structure (MS-CASTLE), to estimate the robustness of linear causal structures. We studied the global equity risk pandemic markets, during covid-19, illustrating how MS-CASTLE can extract meaningful information. We identified the stock markets that drive the risk during the considered period: Brazil, Canada and Italy.
arXiv Detail & Related papers (2022-07-16T11:47:32Z)
MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory Prediction [28.438787700968703]
Conditional MUSE offers diverse and simultaneously more accurate predictions compared to the current state-of-the-art. We demonstrate these assertions through a comprehensive set of experiments on nuScenes and SDD benchmarks as well as PFSD, a new synthetic dataset.
arXiv Detail & Related papers (2022-01-18T18:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.