Apollo: An Exploration of Video Understanding in Large Multimodal Models
- URL: http://arxiv.org/abs/2412.10360v1
- Date: Fri, 13 Dec 2024 18:53:24 GMT
- Title: Apollo: An Exploration of Video Understanding in Large Multimodal Models
- Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia,
- Abstract summary: We present a study that helps uncover what effectively drives video understanding in Large Multimodal Models.
Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with a 55.1 on LongVideoBench.
Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.
- Score: 65.06400672040836
- License:
- Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.
Related papers
- A Benchmark for Crime Surveillance Video Analysis with Large Models [22.683394427744616]
Anomaly analysis in surveillance videos is a crucial topic in computer vision.
In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains.
We propose a benchmark for crime surveillance video analysis with large models denoted as UCVL.
arXiv Detail & Related papers (2025-02-13T13:38:17Z) - TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding [10.92767902813594]
We present the TinyLLaVA-Video, a video understanding model with parameters not exceeding 4B that processes video sequences in a simple manner.
We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models.
The code and training recipes are fully open source, with all components and training data publicly available.
arXiv Detail & Related papers (2025-01-26T13:10:12Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding.
We propose to process videos in an online manner and store past video information in a memory bank.
Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z) - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [118.65089648651308]
This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content.
We show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video Question Answering (QA) tasks.
arXiv Detail & Related papers (2024-04-01T17:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.