Related papers: MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

URL: http://arxiv.org/abs/2602.22932v1
Date: Thu, 26 Feb 2026 12:24:17 GMT
Title: MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
Authors: Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan,
Abstract summary: We present MLLM-Sampler Joint Evolution (MSJoE) for efficient long-form video understanding.<n>MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video.<n>A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process.
Score: 25.20420111814606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.

Related papers

RETLLM: Training and Data-Free MLLMs for Multimodal Information Retrieval [2.2125276321198677]
Multimodal information retrieval (MMIR) has gained attention for its flexibility in handling text, images, or mixed queries and candidates.<n>Recent breakthroughs in multimodal large language models (MLLMs) boost MMIR performance by incorporating MLLM knowledge under the contrastive finetuning framework.<n>We introduce a novel framework, RetLLM, designed to query MLLMs for MMIR in a training- and data-free manner.
arXiv Detail & Related papers (2026-02-25T10:31:32Z)
TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [25.675553077419274]
Multimodal Language Models (MLLMs) have demonstrated significant progress in vision tasks, yet they still face challenges when processing long-duration inputs.<n>We propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z)
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders [62.58375366359421]
Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem.<n>Traditional uniform sampling leads to selection of irrelevant content.<n>Post-training MLLMs on thousands of frames imposes a substantial computational burden.<n>We propose threadings with narratives (Nar-KFC) to facilitate effective and efficient long video perception.
arXiv Detail & Related papers (2025-05-30T03:04:28Z)
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z)
Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z)
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z)
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos [67.78336281317347]
Long-form understanding is complicated by the high redundancy of video data and the abundance of query-irrelevant information.<n>We propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos.
arXiv Detail & Related papers (2024-05-29T15:49:09Z)
Elysium: Exploring Object-level Perception in Videos via MLLM [11.02937968639935]
We propose an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models. Elysium: Exploring Object-level Perception in Videos via MLLM is an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models.
arXiv Detail & Related papers (2024-03-25T09:17:15Z)
DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes [11.2645921649719]
Recent vision-language models (LVLMs) for understanding are primarily fine-tuned with various scraped from online platforms.<n>While current LVLMs are primarily trained on existing datasets in broad, general-purpose settings, adapting them to specific downstream scenarios remains challenging.<n>We propose a three-stage framework named DreamFrame for automatically generating style-consistent videos.
arXiv Detail & Related papers (2024-03-03T07:43:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.