VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs
- URL: http://arxiv.org/abs/2502.16602v1
- Date: Sun, 23 Feb 2025 15:04:23 GMT
- Title: VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs
- Authors: Yiming Yang, Yangyang Guo, Hui Lu, Yan Wang,
- Abstract summary: This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias.<n>We first collect a Video Language Bias Evaluation Benchmark, which is specifically designed to assess the language bias in video-involved LVLMs.<n>We also propose Multi-branch Contrastive Decoding (MCD), introducing two expert branches to simultaneously counteract language bias.
- Score: 37.52094200472755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Large Vision-Language Models (LVLMs) have made significant strides across diverse multimodal tasks and benchmarks. This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias, where models tend to prioritize language over video and thus result in incorrect responses. To address this research gap, we first collect a Video Language Bias Evaluation Benchmark, which is specifically designed to assess the language bias in video-involved LVLMs through two key tasks: ambiguous video contrast and interrogative question probing. Accordingly, we design accompanied evaluation metrics that aim to penalize LVLMs being biased by language. In addition, we also propose Multi-branch Contrastive Decoding (MCD), introducing two expert branches to simultaneously counteract language bias potentially generated by the amateur text-only branch. Our experiments demonstrate that i) existing video-involved LVLMs, including both proprietary and open-sourced, are largely limited by the language bias problem; ii) our MCD can effectively mitigate this issue and maintain general-purpose capabilities in various video-involved LVLMs without any additional retraining or alteration to model architectures.
Related papers
- Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)
Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)
Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z) - Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning [15.877954360180468]
Training Multimodal Large Language Models (MLLMs) is resource-intensive and constrained by various training limitations.
We propose the Modular-based Visual Contrastive Decoding (MVCD) framework to move this obstacle.
Our framework leverages LLMs' In-Context Learning (ICL) capability and the proposed visual contrastive-example decoding (CED)
Our results show consistent improvement in model accuracy and well explain the effective components inside our decoding strategy.
arXiv Detail & Related papers (2025-02-17T12:47:00Z) - Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models [7.213221003652941]
This paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs.<n>We first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively.<n>Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets.
arXiv Detail & Related papers (2025-01-14T09:45:10Z) - Fine-grained Video-Text Retrieval: A New Benchmark and Method [25.2967056489715]
We present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1,000 videos sourced from FineAction dataset.<n>Uniquely, our FIBER benchmark provides detailed human-annotated spatial annotations and temporal annotations for each video.<n>Experiment results show that our Video Large Language (VLLE) performs comparably to CLIP-based models on traditional benchmarks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - TempCompass: Do Video LLMs Really Understand Videos? [36.28973015469766]
Existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs.
We propose the textbfTemp benchmark, which introduces a diversity of high-quality temporal aspects and task formats.
arXiv Detail & Related papers (2024-03-01T12:02:19Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.