Breaking the Encoder Barrier for Seamless Video-Language Understanding
- URL: http://arxiv.org/abs/2503.18422v1
- Date: Mon, 24 Mar 2025 08:06:39 GMT
- Title: Breaking the Encoder Barrier for Seamless Video-Language Understanding
- Authors: Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, Jing Liu,
- Abstract summary: We propose ELVA, an encoder-free-LLM that directly models nuanced video-language interactions without relying on a vision encoder.<n>With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95% and inference latency by 92%.
- Score: 22.749949819082484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95\% and inference latency by 92\%, offering a scalable and efficient solution for real-time video understanding.
Related papers
- FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding [17.71123451197036]
complexity of video data and contextual processing limitations still hinder long-video comprehension.
We propose FiLA-Video, a novel framework that integrates multiple frames into a single representation.
FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
arXiv Detail & Related papers (2025-04-29T03:09:46Z) - REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.
Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.
We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.<n>We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.<n>We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z) - Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [26.866184981409607]
Current video models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters)<n>Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders.<n>Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:56Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - LADDER: An Efficient Framework for Video Frame Interpolation [12.039193291203492]
Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video frame restoration etc.
This paper introduces an efficient video frame framework that aims to strike a favorable balance between efficiency and quality.
arXiv Detail & Related papers (2024-04-17T06:47:17Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action
Localization [96.73647162960842]
TAL is a fundamental yet challenging task in video understanding.
Existing TAL methods rely on pre-training a video encoder through action classification supervision.
We introduce a novel low-fidelity end-to-end (LoFi) video encoder pre-training method.
arXiv Detail & Related papers (2021-03-28T22:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.