Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
Understanding
- URL: http://arxiv.org/abs/2306.02858v4
- Date: Wed, 25 Oct 2023 06:23:31 GMT
- Title: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
Understanding
- Authors: Hang Zhang, Xin Li, Lidong Bing
- Abstract summary: Video-LLaMA is a framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.
Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs.
We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses.
- Score: 61.80870130860662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Video-LLaMA a multi-modal framework that empowers Large Language
Models (LLMs) with the capability of understanding both visual and auditory
content in the video. Video-LLaMA bootstraps cross-modal training from the
frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike
previous works that complement LLMs to process the visual or audio signals
only, Video-LLaMA enables video comprehension by tackling two challenges: (1)
capturing the temporal changes in visual scenes, (2) integrating audio-visual
signals. To counter the first challenge, we propose a Video Q-former to
assemble a pre-trained image encoder into our video encoder and introduce a
video-to-text generation task to learn video-language correspondence. For the
second challenge, we leverage ImageBind, a universal embedding model aligning
multiple modalities, as the pre-trained audio encoder and introduce an Audio
Q-former on top of ImageBind to learn reasonable auditory query embeddings for
the LLM module. To align the output of both visual and audio encoders with
LLM's embedding space, we first train Video-LLaMA on massive
video/image-caption pairs and then tune our model with visual-instruction
datasets of moderate amount but higher quality. We found Video-LLaMA shows the
ability to perceive and comprehend video content and generate meaningful
responses grounded in the visual and auditory information presented in the
videos.
Related papers
- Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner [53.671484175063995]
Video-LLMs are pre-trained to process short videos, limiting their broader application for understanding longer video content.
We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector.
arXiv Detail & Related papers (2024-09-19T17:59:55Z) - video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well.
Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z) - VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs [55.82090875098132]
VideoLLaMA 2 is a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks.
VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks.
arXiv Detail & Related papers (2024-06-11T17:22:23Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - VideoPoet: A Large Language Model for Zero-Shot Video Generation [78.57171527944774]
VideoPoet is a language model capable of synthesizing high-quality video with matching audio.
VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs.
arXiv Detail & Related papers (2023-12-21T18:46:41Z) - Audio-Visual LLM for Video Understanding [25.963166809113005]
This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding.
We introduce a high-quality video instruction dataset, derived from GPT-4.
Experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a range of video understanding tasks.
arXiv Detail & Related papers (2023-12-11T02:50:46Z) - Fine-grained Audio-Visual Joint Representations for Multimodal Large
Language Models [25.660343393359565]
This paper proposes a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal large language models (LLM)
FAVOR simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level.
An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
arXiv Detail & Related papers (2023-10-09T17:00:20Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.