PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
- URL: http://arxiv.org/abs/2311.13435v2
- Date: Wed, 13 Dec 2023 17:24:10 GMT
- Title: PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
- Authors: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul
Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
- Abstract summary: We propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding.
Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks.
- Score: 52.83065081926238
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extending image-based Large Multimodal Models (LMMs) to videos is challenging
due to the inherent complexity of video data. The recent approaches extending
image-based LMMs to videos either lack the grounding capabilities (e.g.,
VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for
better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we
propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability,
integrating audio cues by transcribing them into text to enrich video-context
understanding. Our framework uses an off-the-shelf tracker and a novel
grounding module, enabling it to spatially localize objects in videos following
user instructions. We evaluate PG-Video-LLaVA using video-based generative and
question-answering benchmarks and introduce new benchmarks specifically
designed to measure prompt-based object grounding performance in videos.
Further, we propose the use of Vicuna over GPT-3.5, as utilized in
Video-ChatGPT, for video-based conversation benchmarking, ensuring
reproducibility of results which is a concern with the proprietary nature of
GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its
advantages to the video domain, delivering promising gains on video-based
conversation and grounding tasks. Project Page:
https://github.com/mbzuai-oryx/Video-LLaVA
Related papers
- Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension [83.00346826110041]
Video-RAG is a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment.
Our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.
arXiv Detail & Related papers (2024-11-20T07:44:34Z) - VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs.
The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions.
Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z) - PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance [44.08446730529495]
We propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation.
Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short.
arXiv Detail & Related papers (2024-11-04T17:50:36Z) - Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model [62.38322742493649]
We build a video VQA benchmark covering editing categories, i.e., effect, funny, meme, and game.
Most of the open-source video LMMs perform poorly on the benchmark, indicating a huge domain gap between edited short videos on social media and regular raw videos.
To improve the generalization ability of LMMs, we collect a training set for the proposed benchmark based on both Panda-70M/WebVid raw videos and small-scale TikTok/CapCut edited videos.
arXiv Detail & Related papers (2024-06-15T03:28:52Z) - VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
Video LMMs rely on either image or video encoders to process visual inputs, each of which has its own limitations.
We introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling)
Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.
arXiv Detail & Related papers (2024-06-13T17:59:59Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [59.525108086957296]
Video-ChatGPT is a multimodal model that merges a video-adapted visual encoder with an LLM.
It is capable of understanding and generating detailed conversations about videos.
We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT.
arXiv Detail & Related papers (2023-06-08T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.