Valley: Video Assistant with Large Language model Enhanced abilitY
- URL: http://arxiv.org/abs/2306.07207v2
- Date: Sun, 8 Oct 2023 09:49:53 GMT
- Title: Valley: Video Assistant with Large Language model Enhanced abilitY
- Authors: Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu,
Tao Wang, Linmei Hu, Minghui Qiu, Zhongyu Wei
- Abstract summary: We introduce Valley, a Video Assistant with Large Language model Enhanced abilitY.
To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset.
We employ ChatGPT to facilitate the construction of task-oriented conversation data.
- Score: 41.79449203718827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs), with their remarkable conversational
capabilities, have demonstrated impressive performance across various
applications and have emerged as formidable AI assistants. In view of this, it
raises an intuitive question: Can we harness the power of LLMs to build
multimodal AI assistants for visual applications? Recently, several multi-modal
models have been developed for this purpose. They typically pre-train an
adaptation module to align the semantics of the vision encoder and language
model, followed by fine-tuning on instruction-following data. However, despite
the success of this pipeline in image and language understanding, its
effectiveness in joint video and language understanding has not been widely
explored. In this paper, we aim to develop a novel multi-modal foundation model
capable of comprehending video, image, and language within a general framework.
To achieve this goal, we introduce Valley, a Video Assistant with Large
Language model Enhanced abilitY. The Valley consists of a LLM, a temporal
modeling module, a visual encoder, and a simple projection module designed to
bridge visual and textual modes. To empower Valley with video comprehension and
instruction-following capabilities, we construct a video instruction dataset
and adopt a two-stage tuning procedure to train it. Specifically, we employ
ChatGPT to facilitate the construction of task-oriented conversation data
encompassing various tasks, including multi-shot captions, long video
descriptions, action recognition, causal relationship inference, etc.
Subsequently, we adopt a pre-training-then-instructions-tuned pipeline to align
visual and textual modalities and improve the instruction-following capability
of Valley. Qualitative experiments demonstrate that Valley has the potential to
function as a highly effective video assistant that can make complex video
understanding scenarios easy.
Related papers
- VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - MuLTI: Efficient Video-and-Language Understanding with Text-Guided
MultiWay-Sampler and Multiple Choice Modeling [7.737755720567113]
This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model.
We design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules.
We also propose a new pretraining task named Multiple Choice Modeling.
arXiv Detail & Related papers (2023-03-10T05:22:39Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.