TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding
- URL: http://arxiv.org/abs/2501.15513v1
- Date: Sun, 26 Jan 2025 13:10:12 GMT
- Title: TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding
- Authors: Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang,
- Abstract summary: We present the TinyLLaVA-Video, a video understanding model with parameters not exceeding 4B that processes video sequences in a simple manner.<n>We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models.<n>The code and training recipes are fully open source, with all components and training data publicly available.
- Score: 10.92767902813594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the TinyLLaVA-Video, a video understanding model with parameters not exceeding 4B that processes video sequences in a simple manner, without the need for complex architectures, supporting both fps sampling and uniform frame sampling. Our model is characterized by modularity and scalability, allowing training and inference with limited computational resources and enabling users to replace components based on their needs. We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models on multiple video understanding benchmarks. The code and training recipes are fully open source, with all components and training data publicly available. We hope this work can serve as a baseline for practitioners exploring small-scale multimodal models for video understanding. It is available at \url{https://github.com/ZhangXJ199/TinyLLaVA-Video}.
Related papers
- ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z) - APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval [41.81696346270799]
Current large language models (LMs) struggle with hour-level video understanding.<n>bftextAdaptive textbfPivot MLbfVisual information textbfRetrieval (textbfAPVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information.
arXiv Detail & Related papers (2025-06-05T12:27:10Z) - An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.<n>We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.<n>With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z) - TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning [7.818698554631196]
We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources.
We present the small-scale video reasoning model TinyLLaVA-Video-R1.
arXiv Detail & Related papers (2025-04-13T16:32:49Z) - PAVE: Patching and Adapting Video Large Language Models [10.252884653843344]
We present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals.<n>PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model.<n>PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models.
arXiv Detail & Related papers (2025-03-25T16:02:37Z) - Pretrained Image-Text Models are Secretly Video Captioners [38.66202065611397]
We find that an image-based model can be repurposed to outperform several specialised video captioning systems.
Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX.
From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning.
arXiv Detail & Related papers (2025-02-19T01:53:03Z) - Apollo: An Exploration of Video Understanding in Large Multimodal Models [65.06400672040836]
We present a study that helps uncover what effectively drives video understanding in Large Multimodal Models.<n>Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with a 55.1 on LongVideoBench.<n>Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.
arXiv Detail & Related papers (2024-12-13T18:53:24Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - A strong baseline for image and video quality assessment [4.73466728067544]
We present a simple yet effective unified model for perceptual quality assessment of image and video.
Our model achieves a comparable performance by applying only one global feature derived from a backbone network.
Based on the architecture proposed, we release the models well trained for three common real-world scenarios.
arXiv Detail & Related papers (2021-11-13T12:24:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.