Fugu-MT 論文翻訳(概要): GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

論文の概要: GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

arxiv url: http://arxiv.org/abs/2311.16511v1
Date: Sat, 25 Nov 2023 04:05:59 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-29 19:45:23.120152
Title: GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Title（参考訳）: GPT4Video:Lnstruction-Followed Understanding and Safety-Aware Generationのための統合マルチモーダル大言語モデル
Authors: Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shuming Shi, Zhaopeng Tu
Abstract要約: GPT4Videoは、ビデオ理解と生成の両方の能力で大規模言語モデルを強化する統一されたマルチモデルフレームワークである。具体的には、安定拡散生成モデルと統合された命令追従型アプローチを開発し、映像生成シナリオを効果的かつ安全に扱うことを実証した。
参考スコア（独自算出の注目度）: 103.56612788682973
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8\% on the Video Question Answering task, and surpasses NExt-GPT by 2.3\% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios.
Abstract（参考訳）: 近年のMLLM(Multimodal Large Language Models)の進歩はこの分野において大きな進歩となっているが、これらのモデルは入力側マルチモーダル理解の領域に限られており、マルチモーダルコンテンツ生成能力に欠ける。このギャップを埋めるために、ビデオ理解と生成の両方の能力でLLM(Large Language Models)を強化する統合マルチモデルフレームワークであるGPT4Videoを提案する。具体的には,安定拡散生成モデルと統合した命令追従型手法を開発し,映像生成シナリオを効果的かつ安全に処理できることを実証した。 GPT4Videoは以下の利点を提供する。 1)ビデオ理解と生成シナリオの両方において印象的な能力を示す。例えば、GPT4Videoはビデオ質問回答タスクで11.8\%、テキスト・トゥ・ビデオ生成タスクで2.3\%を上回っている。 2) LLM/MLLMには、追加のトレーニングパラメータを必要とせずにビデオ生成機能を備えており、ビデオ生成を行うために幅広いモデルと柔軟にインターフェースすることができる。 3) 出力側だけでなく、入力側もエンドツーエンドで安全かつ健全な会話を維持する。質的および質的な実験は、GPT4Videoがビデオ理解と生成シナリオの両方を扱える、効果的で安全でヒューマノイドのようなビデオアシスタントとして機能する可能性を実証している。

論文の概要: GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

関連論文リスト