Related papers: From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

URL: http://arxiv.org/abs/2503.20472v1
Date: Wed, 26 Mar 2025 11:53:03 GMT
Title: From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Authors: Yucheng Suo, Fan Ma, Linchao Zhu, Tianyi Wang, Fengyun Rao, Yi Yang,
Abstract summary: Multi-modal Large language models can only process a finite number of frames in a single inference.<n>We propose multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction.<n> Experiments show that this approach covers the correct answer for a high percentage of long video questions.
Score: 51.3011761744484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves the performance of three MLLMs.

Related papers

Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS) It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens. Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly [44.31985939516153]
Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks. We show that MLLMs often generate incorrect answers even when they understand the visual content. We propose to enhance the model's focus on visual content during decoding by refining the text and visual prompt.
arXiv Detail & Related papers (2024-06-15T13:58:26Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [11.244643114253773]
Video Question (VideoQA) aims to answer natural language questions based on the information observed in videos. We propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs.
arXiv Detail & Related papers (2024-01-19T14:21:46Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Leveraging VLM-Based Pipelines to Annotate 3D Objects [68.51034848207355]
We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response. Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods. We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
arXiv Detail & Related papers (2023-11-29T17:54:22Z)
Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query. We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data. We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z)
Self-supervised pre-training and contrastive representation learning for multiple-choice video QA [39.78914328623504]
Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions. We propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning. We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA.
arXiv Detail & Related papers (2020-09-17T03:37:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.