Related papers: Hallucination Mitigation Prompts Long-term Video Understanding

Hallucination Mitigation Prompts Long-term Video Understanding

URL: http://arxiv.org/abs/2406.11333v1
Date: Mon, 17 Jun 2024 08:44:03 GMT
Title: Hallucination Mitigation Prompts Long-term Video Understanding
Authors: Yiwei Sun, Zhihang Liu, Chuanbin Liu, Bowei Pu, Zhihan Zhang, Hongtao Xie,
Abstract summary: This paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. We use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. During the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers.
Score: 36.26790392889717
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers. It is worth mentioning that for the breakpoint mode, we found that image understanding models achieved better results than video understanding models. Therefore, we aggregated the answers from both types of models using a comparison mechanism. Ultimately, We achieved 84.2\% and 62.9\% for the global and breakpoint modes respectively on the MovieChat dataset, surpassing the official baseline model by 29.1\% and 24.1\%. Moreover the proposed method won the third place in the CVPR LOVEU 2024 Long-Term Video Question Answering Challenge. The code is avaiable at https://github.com/lntzm/CVPR24Track-LongVideo

Related papers

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information. We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z)
VidCtx: Context-aware Video Question Answering with Image Models [15.1350316858766]
We introduce VidCtx, a novel training-free VideoQA framework which integrates both visual information from input frames and textual descriptions of others frames. Experiments show that VidCtx achieves competitive performance among approaches that rely on open models.
arXiv Detail & Related papers (2024-12-23T09:26:38Z)
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding [43.858197893052115]
CG-Bench is a novel benchmark for clue-grounded question answering in long videos. It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories. The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination.
arXiv Detail & Related papers (2024-12-16T18:46:45Z)
Lost in Time: A New Temporal Benchmark for VideoLLMs [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding [41.9477837230283]
LongVideoBench is a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes. We formulate a novel video question-answering task termed referring reasoning.
arXiv Detail & Related papers (2024-07-22T16:00:55Z)
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z)
Encoding and Controlling Global Semantics for Long-form Video Question Answering [40.129800076300434]
We introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video. Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length.
arXiv Detail & Related papers (2024-05-30T06:10:10Z)
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [11.244643114253773]
Video Question (VideoQA) aims to answer natural language questions based on the information observed in videos. We propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs.
arXiv Detail & Related papers (2024-01-19T14:21:46Z)
A Simple LLM Framework for Long-Range Video Question-Answering [63.50439701867275]
We present LLoVi, a language-based framework for long-range video question-answering (LVQA) Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4) Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
arXiv Detail & Related papers (2023-12-28T18:58:01Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.