Fugu-MT 論文翻訳(概要): Lightweight In-Context Tuning for Multimodal Unified Models

論文の概要: Lightweight In-Context Tuning for Multimodal Unified Models

arxiv url: http://arxiv.org/abs/2310.05109v1
Date: Sun, 8 Oct 2023 10:47:24 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-12 12:23:51.382511
Title: Lightweight In-Context Tuning for Multimodal Unified Models
Title（参考訳）: マルチモーダル統一モデルのための軽量インコンテキストチューニング
Authors: Yixin Chen, Shuai Zhang, Boran Han, Jiaya Jia
Abstract要約: MultiModal In-conteXt Tuning (M$2$IXT)は、マルチモーダル統一モデルのICL機能を強化する軽量モジュールである。最大50Kのマルチモーダルデータをチューニングすると、M$2$IXTは数ショットのICL性能を大幅に向上させることができる。
参考スコア（独自算出の注目度）: 57.10831399642176
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In-context learning (ICL) involves reasoning from given contextual examples. As more modalities comes, this procedure is becoming more challenging as the interleaved input modalities convolutes the understanding process. This is exemplified by the observation that multimodal models often struggle to effectively extrapolate from contextual examples to perform ICL. To address these challenges, we introduce MultiModal In-conteXt Tuning (M$^2$IXT), a lightweight module to enhance the ICL capabilities of multimodal unified models. The proposed M$^2$IXT module perceives an expandable context window to incorporate various labeled examples of multiple modalities (e.g., text, image, and coordinates). It can be prepended to various multimodal unified models (e.g., OFA, Unival, LLaVA) of different architectures and trained via a mixed-tasks strategy to enable rapid few-shot adaption on multiple tasks and datasets. When tuned on as little as 50K multimodal data, M$^2$IXT can boost the few-shot ICL performance significantly (e.g., 18\% relative increase for OFA), and obtained state-of-the-art results across an array of tasks including visual question answering, image captioning, visual grounding, and visual entailment, while being considerably small in terms of model parameters (e.g., $\sim$$20\times$ smaller than Flamingo or MMICL), highlighting the flexibility and effectiveness of M$^2$IXT as a multimodal in-context learner.
Abstract（参考訳）: In-context Learning (ICL) は、与えられた文脈の例から推論する。より多くのモダリティが現れるにつれて、この手順は、インターリーブされた入力モダリティが理解プロセスに畳み込み、より困難になってきている。これは、マルチモーダルモデルが、iclを実行するために文脈的な例から効果的に外挿するのに苦労しているという観察から示される。これらの課題に対処するために、マルチモーダル統一モデルのICL機能を強化する軽量モジュールであるMultiModal In-conteXt Tuning (M$^2$IXT)を導入する。提案されたM$^2$IXTモジュールは拡張可能なコンテキストウィンドウを認識し、複数のモード(テキスト、画像、座標など)のラベル付きサンプルを組み込む。異なるアーキテクチャの様々なマルチモーダル統一モデル(OFA、Unival、LLaVAなど)に事前適用可能であり、複数のタスクやデータセットに対する迅速な数発の適応を可能にする混合タスク戦略を通じて訓練される。 When tuned on as little as 50K multimodal data, M$^2$IXT can boost the few-shot ICL performance significantly (e.g., 18\% relative increase for OFA), and obtained state-of-the-art results across an array of tasks including visual question answering, image captioning, visual grounding, and visual entailment, while being considerably small in terms of model parameters (e.g., $\sim$$20\times$ smaller than Flamingo or MMICL), highlighting the flexibility and effectiveness of M$^2$IXT as a multimodal in-context learner.

関連論文リスト

$M^3EL$: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking [11.334577756093923]
我々はデータセット構築パイプラインを提案し、MELのための大規模データセットであるM3EL$を発行する。 M3EL$には79,625のインスタンスが含まれ、9つの多様なマルチモーダルタスクと5つのトピックが含まれている。我々のデータセットはこれらの問題に効果的に対処し、$textitCLIP_textitND$モデルに$M3EL$を微調整すると精度が大幅に向上する。
論文参考訳（メタデータ） (2024-10-08T10:52:23Z)
M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
MLLM(Multimodal Large Language Models)は、幅広い領域にわたる顕著なパフォーマンスを示す。本研究では,MLLMの効率的な命令チューニングのための新しいMultimodal Prompt Tuning (M$2$PT) 手法を提案する。
論文参考訳（メタデータ） (2024-09-24T01:40:24Z)
What Makes Multimodal In-Context Learning Work? [58.48612721156335]
本稿では,M-ICL(Multimodal ICL)を大規模マルチモーダルモデルで検討するための枠組みを提案する。 M-ICLは主にテキスト駆動機構に依存しており、画像のモダリティからはほとんど影響を受けない。我々は、M-ICLのいくつかのバイアスと限界を特定し、デプロイメント前に考慮することを保証している。
論文参考訳（メタデータ） (2024-04-24T08:50:45Z)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) は、異なるモデルに転送可能な視覚的プロンプトを生成するためのシンプルで効果的なアプローチである。本稿では,既存の視覚的プロンプト手法のクロスモデル特徴劣化問題に対処し,学習したプロンプトの伝達可能性を高めるための2つの戦略を提案する。
論文参考訳（メタデータ） (2024-04-17T09:39:07Z)
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment [11.897888221717245]
マルチモーダルな特徴アライメントを実現するためのCLIP誘導型コントラスト学習型アーキテクチャを提案する。我々のモデルはタスク固有の外部知識を使わずに実装が簡単であり、そのため、他のマルチモーダルタスクに容易に移行できる。
論文参考訳（メタデータ） (2024-03-11T01:07:36Z)
LLMBind: A Unified Modality-Task Integration Framework [38.95771765322677]
多様なマルチモーダルタスクを統一する新しいフレームワークである textbfLLMBind を導入する。 LLMBindはMixture-of-Experts (MoE) Large Language Model (LLM)を利用してマルチモーダル入力を処理し、タスク固有のトークンを生成する。
論文参考訳（メタデータ） (2024-02-22T12:36:31Z)
Toward Robust Multimodal Learning using Multimodal Foundational Models [30.755818450393637]
マルチモーダル基礎モデルを用いたロバストなマルチモーダル学習に向けたTRMLを提案する。 TRMLは、欠落したモダリティを置き換えるために生成された仮想モダリティを使用する。またセマンティックマッチング学習モジュールを設計し、セマンティック空間の生成とモダリティの欠如を協調する。
論文参考訳（メタデータ） (2024-01-20T04:46:43Z)
Generative Multimodal Models are In-Context Learners [60.50927925426832]
我々は37億のパラメータを持つ生成的マルチモーダルモデルであるEmu2を紹介し、大規模マルチモーダルシーケンスで訓練する。 Emu2は、マルチモーダルなインコンテキスト学習能力を示し、オンザフライ推論を必要とするタスクを解決しようとさえしている。
論文参考訳（メタデータ） (2023-12-20T18:59:58Z)
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
本稿では,新しいマルチモーダル微調整パラダイムであるMMICTを紹介する。 M-Hub(Multi-Modal Hub)は,異なる入力や目的に応じて様々なマルチモーダル特徴をキャプチャするモジュールである。 M-Hubに基づいてMMICTは、MM-LLMがコンテキスト内視覚誘導されたテキスト特徴から学習し、その後、テキスト誘導された視覚特徴に基づいて条件付き出力を生成する。
論文参考訳（メタデータ） (2023-12-11T13:11:04Z)
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [89.19867891570945]
mPLUG-2は、マルチモーダル事前訓練のためのモジュール化された設計を備えた新しい統一パラダイムである。モダリティ協力のための共通普遍加群を共有し、モダリティの絡み合いを扱うために異なるモダリティ加群を切り離す。テキスト、画像、ビデオを含むすべてのモダリティの異なる理解タスクと生成タスクのために、異なるモジュールを選択することは柔軟です。
論文参考訳（メタデータ） (2023-02-01T12:40:03Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。