Fugu-MT 論文翻訳(概要): Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

論文の概要: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

arxiv url: http://arxiv.org/abs/2306.02858v1
Date: Mon, 5 Jun 2023 13:17:27 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-06 15:02:11.342235
Title: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Title（参考訳）: Video-LLaMA: 映像理解のための命令調整型オーディオ・ビジュアル言語モデル
Authors: Hang Zhang, Xin Li, Lidong Bing
Abstract要約: Video-LLaMAは、ビデオ内の視覚的および聴覚的コンテンツの両方を理解する能力を備えた、大規模言語モデル(LLM)を強化するフレームワークである。 Video-LLaMAのブートストラップは、凍結された事前訓練されたビジュアルおよびオーディオエンコーダと凍結されたLCMからのクロスモーダルトレーニングである。 Video-LLaMAは,映像コンテンツを知覚し,理解し,意味のある応答を生成する能力を示す。
参考スコア（独自算出の注目度）: 37.46602744829322
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual \& audio encoders and the frozen LLMs. Unlike previous vision- LLMs that focus on static image comprehensions such as MiniGPT-4~\citep{zhu2023minigpt} and LLaVA~\citep{liu2023visualit}, Video-LLaMA tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. For the first challenge, we propose Video Q-former to extend the pre-trained image encoder to a video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind~\citep{girdhar2023imagebind} as the pre-trained audio encoder which performs exceptionally well in aligning different modalities to a common embedding space. And then introduce an Audio Q-former to learn auditory query tokens. To align the output of both visual \& audio encoder with LLM's embedding space, we train Video-LLaMA on a large-scale vision caption dataset and a hign-quantity vision-instruction-tuning dataset. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information present in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants. Our code, pre-trained model, and demo are available at \url{https://github.com/DAMO-NLP-SG/Video-LLaMA}.
Abstract（参考訳）: 本稿では,ビデオ内の視覚的・聴覚的コンテンツの両方を理解する能力を備えた大規模言語モデル(LLM)を実現するマルチモーダルフレームワークであるVideo-LLaMAを提案する。 Video-LLaMAは、凍結事前訓練されたビジュアル \&オーディオエンコーダと凍結LDMからのクロスモーダルトレーニングを行う。 MiniGPT-4~\citep{zhu2023minigpt} や LLaVA~\citep{liu2023visualit} のような静的画像理解に焦点を当てた従来のビジョン-LLMとは異なり、Video-LLaMAはビデオ理解における2つの課題に対処している。最初の課題として,事前学習した画像エンコーダをビデオエンコーダに拡張するビデオQ-formerを提案し,ビデオ言語対応学習のためのビデオ-テキスト生成タスクを提案する。第2の課題として、imagebind~\citep{girdhar2023imagebind}を事前学習したオーディオエンコーダとして活用し、異なるモダリティを共通の埋め込み空間に整列させる。そして、聴覚クエリトークンを学ぶためのAudio Q-formerを導入する。映像と音声のエンコーダの出力をLLMの埋め込み空間と整合させるため,大規模視覚キャプションデータセットとハイクオリティ視覚インストラクションチューニングデータセットでVideo-LLaMAを訓練する。ビデオ-LLaMAは,映像内容の知覚と理解能力を示し,映像に含まれる視覚情報や聴覚情報に基づく有意義な応答を生成する。このことは、オーディオ視覚AIアシスタントの有望なプロトタイプとしてのVideo-LLaMAの可能性を強調している。私たちのコード、事前トレーニングされたモデル、デモは、 \url{https://github.com/DAMO-NLP-SG/Video-LLaMA}で公開されています。

論文の概要: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

関連論文リスト