Efficient Motion-Aware Video MLLM
- URL: http://arxiv.org/abs/2503.13016v1
- Date: Mon, 17 Mar 2025 10:20:05 GMT
- Title: Efficient Motion-Aware Video MLLM
- Authors: Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wang, Weipeng Chen, Jing Liu,
- Abstract summary: We introduce EMA, an Efficient Motion-Aware video MLLM that utilizes compressed video structures as inputs.<n>We propose a motion-aware GOP encoder that fuses spatial and motion information within a GOP unit in the compressed video stream, generating compact, informative visual tokens.<n>We also introduce MotionBench, a benchmark for evaluating motion understanding across four motion types.
- Score: 17.383054449879985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most current video MLLMs rely on uniform frame sampling and image-level encoders, resulting in inefficient data processing and limited motion awareness. To address these challenges, we introduce EMA, an Efficient Motion-Aware video MLLM that utilizes compressed video structures as inputs. We propose a motion-aware GOP (Group of Pictures) encoder that fuses spatial and motion information within a GOP unit in the compressed video stream, generating compact, informative visual tokens. By integrating fewer but denser RGB frames with more but sparser motion vectors in this native slow-fast input architecture, our approach reduces redundancy and enhances motion representation. Additionally, we introduce MotionBench, a benchmark for evaluating motion understanding across four motion types: linear, curved, rotational, and contact-based. Experimental results show that EMA achieves state-of-the-art performance on both MotionBench and popular video question answering benchmarks, while reducing inference costs. Moreover, EMA demonstrates strong scalability, as evidenced by its competitive performance on long video understanding benchmarks.
Related papers
- Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent [58.09607975296408]
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation.
The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields.
We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
arXiv Detail & Related papers (2025-02-05T14:26:07Z) - MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models [30.139277087078764]
MotionBench is an evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models.<n>It includes data collected from diverse sources, ensuring a broad representation of real-world video content.<n>Our benchmark aims to guide and motivate the development of more capable video understanding models.
arXiv Detail & Related papers (2025-01-06T11:57:38Z) - MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations [85.85596165472663]
We build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions.
Our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding.
arXiv Detail & Related papers (2024-10-17T17:31:24Z) - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.<n> SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.<n>Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z) - Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning [16.094271750354835]
Motion information is critical to a robust and generalized video representation.
Recent works have adopted frame difference as the source of motion information in video contrastive learning.
We present a framework capable of introducing well-aligned and significant motion information.
arXiv Detail & Related papers (2023-09-01T07:03:27Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Representation Learning for Compressed Video Action Recognition via
Attentive Cross-modal Interaction with Motion Enhancement [28.570085937225976]
This paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement.
It follows the two-stream architecture, i.e. one for the RGB modality and the other for the motion modality.
Experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.
arXiv Detail & Related papers (2022-05-07T06:26:49Z) - MotionSqueeze: Neural Motion Feature Learning for Video Understanding [46.82376603090792]
Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information.
In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features.
We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost.
arXiv Detail & Related papers (2020-07-20T08:30:14Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.