V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
- URL: http://arxiv.org/abs/2503.17736v1
- Date: Sat, 22 Mar 2025 11:30:46 GMT
- Title: V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
- Authors: Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao,
- Abstract summary: Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently.<n>Current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language.<n>We propose the Video Visual Prompt Benchmark(V2P-Bench), a benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios.
- Score: 17.038321383586037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.
Related papers
- H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding [25.111988967973147]
Existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability.
We propose a hierarchical and holistic video understanding benchmark designed to evaluate both general video and online streaming video comprehension.
This benchmark contributes three key features: extended video duration, comprehensive assessment tasks, andEnriched video data.
arXiv Detail & Related papers (2025-03-31T12:32:51Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.
Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.
We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [56.78088668917983]
We introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains.
We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos.
Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding.
arXiv Detail & Related papers (2025-02-15T14:29:44Z) - HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [52.696422425058245]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.
HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)
We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z) - Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison [15.363132825156477]
Video description serves as a fundamental task for evaluating video comprehension, requiring a deep understanding of spatial and temporal dynamics.
Current benchmarks for video comprehension have notable limitations, including short video durations, brief annotations, and reliance on a single annotator's perspective.
We propose a novel benchmark, FIOVA, designed to evaluate the differences between LVLMs and human understanding more comprehensively.
arXiv Detail & Related papers (2024-10-20T03:59:54Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.08008540513596]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z) - Enhancing Video Transformers for Action Understanding with VLM-aided Training [10.02739652443895]
We propose a framework that takes advantage of the complementary strengths of ViTs and VLMs.
The FTP framework adds processors that focus on specific aspects of human action in videos.
We achieve remarkable top-1 accuracy of 93.8% on Kinetics-400 and Something Something-Something V2, surpassing VideoMAEv2 by 2.8% and 2.6%, respectively.
arXiv Detail & Related papers (2024-03-24T12:55:50Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.