TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
- URL: http://arxiv.org/abs/2504.09641v1
- Date: Sun, 13 Apr 2025 16:32:49 GMT
- Title: TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
- Authors: Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang,
- Abstract summary: We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources.<n>We present the small-scale video reasoning model TinyLLaVA-Video-R1.
- Score: 7.818698554631196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.
Related papers
- PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding [126.15907330726067]
We build a Perception Model Language (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.
We analyze standard training pipelines without distillation from models and explore large-scale synthetic data to identify critical data gaps.
arXiv Detail & Related papers (2025-04-17T17:59:56Z) - Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.
After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.
Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - Video-R1: Reinforcing Video Reasoning in MLLMs [27.99261687064233]
Video-R1 is the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models.<n>We first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning.<n>Instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process.
arXiv Detail & Related papers (2025-03-27T17:59:51Z) - TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding [10.92767902813594]
We present the TinyLLaVA-Video, a video understanding model with parameters not exceeding 4B that processes video sequences in a simple manner.<n>We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models.<n>The code and training recipes are fully open source, with all components and training data publicly available.
arXiv Detail & Related papers (2025-01-26T13:10:12Z) - VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input.<n>We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z) - Apollo: An Exploration of Video Understanding in Large Multimodal Models [65.06400672040836]
We present a study that helps uncover what effectively drives video understanding in Large Multimodal Models.<n>Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with a 55.1 on LongVideoBench.<n>Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.
arXiv Detail & Related papers (2024-12-13T18:53:24Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - Language Model Guided Interpretable Video Action Reasoning [32.999621421295416]
We present a new framework named Language-guided Interpretable Action Recognition framework (LaIAR)
LaIAR leverages knowledge from language models to enhance both the recognition capabilities and the interpretability of video models.
In essence, we redefine the problem of understanding video model decisions as a task of aligning video and language models.
arXiv Detail & Related papers (2024-04-02T02:31:13Z) - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [118.65089648651308]
This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content.
We show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video Question Answering (QA) tasks.
arXiv Detail & Related papers (2024-04-01T17:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.