Fugu-MT 論文翻訳(概要): COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

論文の概要: COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

arxiv url: http://arxiv.org/abs/2401.00849v1
Date: Mon, 1 Jan 2024 18:58:42 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-03 15:35:18.367216
Title: COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Title（参考訳）: COSMO: インターリーブプレトレーニングによる圧縮流線形マルチモードモデル
Authors: Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
Abstract要約: 近年の自己回帰視覚言語モデルは, テキスト生成タスクでは優れているが, アライメントタスクでは課題に直面している。テキスト生成モデルに対照的な損失を導入し,言語モデルを専用テキスト処理と適応型マルチモーダルデータ処理コンポーネントに分割する。このギャップを埋めるために、この研究は、包括的なキャプションを備えた最初のインターリーブ付きビデオテキストデータセットであるVideoDatasetNameを導入した。
参考スコア（独自算出の注目度）: 119.03392147066093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.
Abstract（参考訳）: Vision-Language 事前学習の進化において、短文理解から拡張されたテキストコンテキストへのシフトが重要である。最近の自動回帰視覚言語モデルである \cite{flamingo, palme} は、Large Language Models の長期コンテキスト機能を活用し、数発のテキスト生成タスクで優れているが、アライメントタスクでは課題に直面している。このギャップに対処するために、テキスト生成モデルに対照的な損失を導入し、Contrastive-Streamlined MultimOdal framework (\ModelName)を提示し、言語モデルを戦略的に非モーダルテキスト処理とアドレプトマルチモーダルデータ処理コンポーネントに分割する。統一されたフレームワークである \modelnameは、ユニモーダルおよびマルチモーダル要素をマージし、学習可能なパラメータを著しく削減しながら、テキストおよびビジュアルデータを含むタスクのモデルパフォーマンスを高めます。しかし、これらのモデルは広範囲にわたる長文データセットを要求するが、高品質な長文ビデオデータセットの利用可能性は限られている。このギャップを埋めるため、本研究では、包括的なキャプションを特徴とする最初のインターリーブされたビデオテキストデータセットである \videodatasetnameを導入する。その影響を示すために, \videodatasetname{} が画像テキストタスクのモデル性能をどのように向上させるかを示す。学習可能なパラメータの34%、利用可能なデータの72\%を活用することで、openflamingo~\cite{openflamingo}よりも優れた結果が得られる。例えば、4ショットのフリックキャプションタスクでは、パフォーマンスが57.2%から65.\%に顕著に向上する。 \modelname{} と \videodatasetname{} の貢献は、画像テキストとビデオテキストの両方のタスクを含む14のダウンストリームデータセットで注目すべきパフォーマンス向上によって裏付けられている。

論文の概要: COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

関連論文リスト