MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
- URL: http://arxiv.org/abs/2404.03413v1
- Date: Thu, 4 Apr 2024 12:46:01 GMT
- Title: MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
- Authors: Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny,
- Abstract summary: MiniGPT4-Video is a multimodal Large Language Model (LLM) designed specifically for video understanding.
MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components.
- Score: 36.02433030551474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here https://vision-cair.github.io/MiniGPT4-video/
Related papers
- Pretrained Image-Text Models are Secretly Video Captioners [38.66202065611397]
We find that an image-based model can be repurposed to outperform several specialised video captioning systems.
Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX.
From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning.
arXiv Detail & Related papers (2025-02-19T01:53:03Z) - TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding [10.92767902813594]
We present the TinyLLaVA-Video, a video understanding model with parameters not exceeding 4B that processes video sequences in a simple manner.
We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models.
The code and training recipes are fully open source, with all components and training data publicly available.
arXiv Detail & Related papers (2025-01-26T13:10:12Z) - Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM [28.64108439552772]
We introduce a large-scale synthetic dataset created from proprietary models.
We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance.
Our proposed model achieves state-of-the-art results across various video tasks and shows impressive generalization.
arXiv Detail & Related papers (2024-12-12T18:20:41Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - CogVLM2: Visual Language Models for Image and Video Understanding [69.361109860391]
We propose the CogVLM2 family, a new generation of visual language models for image and video understanding.
As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages.
As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction.
arXiv Detail & Related papers (2024-08-29T12:59:12Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [27.04277811443469]
Video-LLaVA learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
arXiv Detail & Related papers (2023-11-16T10:59:44Z) - Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [59.525108086957296]
Video-ChatGPT is a multimodal model that merges a video-adapted visual encoder with an LLM.
It is capable of understanding and generating detailed conversations about videos.
We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT.
arXiv Detail & Related papers (2023-06-08T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.