Related papers: MarineEval: Assessing the Marine Intelligence of Vision-Language Models

MarineEval: Assessing the Marine Intelligence of Vision-Language Models

URL: http://arxiv.org/abs/2512.21126v1
Date: Wed, 24 Dec 2025 11:57:50 GMT
Title: MarineEval: Assessing the Marine Intelligence of Vision-Language Models
Authors: YuK-Kwan Wong, Tuan-An To, Jipeng Zhang, Ziqiang Zheng, Sai-Kit Yeung,
Abstract summary: We construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs.<n>We benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions.
Score: 35.08637645476385
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/

Related papers

MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios? [35.75859316774549]
We present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios.<n> MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets.<n>Our experiments reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios.
arXiv Detail & Related papers (2025-12-29T05:49:54Z)
UVLM: Benchmarking Video Language Model for Underwater World Understanding [11.475921633970977]
We introduce UVLM, a benchmark for underwater video observation.<n> dataset includes 419 classes of marine animals, and various static plants and terrains.<n>Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding.
arXiv Detail & Related papers (2025-07-03T07:08:38Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark [20.642661835794975]
We introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial settings.<n>The benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain.<n>We provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages.
arXiv Detail & Related papers (2025-01-28T03:56:17Z)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z)
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models [51.468732121824125]
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAGs.
arXiv Detail & Related papers (2024-06-17T15:59:49Z)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)<n>It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z)
Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z)
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [31.71954519657729]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.<n>Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.<n>We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z)
MarineGPT: Unlocking Secrets of Ocean to the Public [32.17362940242431]
Large language models (LLMs) have proven to be powerful tools in promoting the user experience as an AI assistant. We propose textbfMarineGPT, the first vision-language model specially designed for the marine domain, unlocking the secrets of the ocean to the public.
arXiv Detail & Related papers (2023-10-20T15:45:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.