Fugu-MT 論文翻訳(概要): PILL: Plug Into LLM with Adapter Expert and Attention Gate

論文の概要: PILL: Plug Into LLM with Adapter Expert and Attention Gate

arxiv url: http://arxiv.org/abs/2311.02126v1
Date: Fri, 3 Nov 2023 09:31:10 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-07 19:25:11.316553
Title: PILL: Plug Into LLM with Adapter Expert and Attention Gate
Title（参考訳）: PILL: アダプタエキスパートとアテンションゲートを備えたLDMにプラグイン
Authors: Fangyuan Zhang, Tingting Liang, Zhengyuan Wu, Yuyu Yin
Abstract要約: 我々は、アダプタの専門家とアテンションゲートを備えたPILL: Plug Into LLMという新しいアーキテクチャを導入する。まず、Mixture-of-Modality-Adapter-Expertを使って異なるモダリティを独立に扱う。第二に、モダリティ・アテンション・ゲーティングを導入することにより、全体表現へのモダリティトークンの寄与を適応的に制御できる。
参考スコア（独自算出の注目度）: 11.956931222769128
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Due to the remarkable capabilities of powerful Large Language Models (LLMs) in effectively following instructions, there has been a growing number of assistants in the community to assist humans. Recently, significant progress has been made in the development of Vision Language Models (VLMs), expanding the capabilities of LLMs and enabling them to execute more diverse instructions. However, it is foreseeable that models will likely need to handle tasks involving additional modalities such as speech, video, and others. This poses a particularly prominent challenge of dealing with the complexity of mixed modalities. To address this, we introduce a novel architecture called PILL: Plug Into LLM with adapter expert and attention gate to better decouple these complex modalities and leverage efficient fine-tuning. We introduce two modules: Firstly, utilizing Mixture-of-Modality-Adapter-Expert to independently handle different modalities, enabling better adaptation to downstream tasks while preserving the expressive capability of the original model. Secondly, by introducing Modality-Attention-Gating, which enables adaptive control of the contribution of modality tokens to the overall representation. In addition, we have made improvements to the Adapter to enhance its learning and expressive capabilities. Experimental results demonstrate that our approach exhibits competitive performance compared to other mainstream methods for modality fusion. For researchers interested in our work, we provide free access to the code and models at https://github.com/DsaltYfish/PILL.
Abstract（参考訳）: 強力な大規模言語モデル(LLM)の効果的な指示に従う能力により、コミュニティには人間を支援するアシスタントが増えている。近年、視覚言語モデル(VLM)の開発が進み、LLMの能力を拡大し、より多様な命令を実行できるようになった。しかし、モデルが音声やビデオなどの追加的なモダリティを伴うタスクを扱う必要があることは予測できる。これは混合モダリティの複雑さを扱う上で特に顕著な課題である。そこで我々は,PILL: Plug Into LLMと呼ばれる新しいアーキテクチャを,アダプタの専門家とアテンションゲートで導入し,これらの複雑なモダリティを分離し,効率的な微調整を実現する。まず、Mixture-of-Modality-Adapter-Expertを使って異なるモードを独立に処理し、元のモデルの表現能力を保ちながら下流タスクへの適応性を向上する。第二に、モダリティ・アテンション・ゲーティングを導入することにより、全体表現へのモダリティトークンの寄与を適応的に制御できる。さらに,その学習能力と表現能力を向上させるために,アダプタの改良も行っています。実験の結果,本手法はモダリティ融合の他の主流手法と比較して競合性能を示すことがわかった。私たちの研究に興味を持つ研究者には、https://github.com/DsaltYfish/PILL.comでコードとモデルへの無償アクセスを提供しています。

関連論文リスト

LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
既存の手法は、モーダル固有の事前訓練とジョイント・モーダルチューニングに大きく依存しており、新しいモーダルへと拡張する際の計算上の負担が大きくなった。 PathWeaveは、Modal-Path sWitchingとExpAnsion機能を備えた柔軟でスケーラブルなフレームワークである。 PathWeaveは最先端のMLLMと互換性があり、パラメータトレーニングの負担を98.73%削減する。
論文参考訳（メタデータ） (2024-10-26T13:19:57Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMAは、視覚的およびテキスト的エンコーディングを効率的に融合するために設計された軽量なクロスプラットフォームモジュールである。 EMMAは複数のタスクのパフォーマンスを最大9.3%向上させ、幻覚に対する堅牢性を大幅に向上させる。
論文参考訳（メタデータ） (2024-10-02T23:00:31Z)
MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtendは、Mixture-of-Experts (MoE)モデルのモダリティ適応と拡張を効率化する効果的なフレームワークである。 MoExtendは、新しいエキスパートをトレーニング済みのMoEモデルにシームレスに統合し、トレーニング済みのモデルをチューニングすることなく、新しい知識を提供する。
論文参考訳（メタデータ） (2024-08-07T02:28:37Z)
Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning [0.0]
大規模言語モデル(LLM)と視覚言語(VL)タスクの統合は、人工知能の領域における変革的な発展である。本稿では,これらの複雑なモデルのマルチモーダル関数の強化に特化して,Bottleneck Adapterと呼ばれる新しいアプローチを提案する。当社のアプローチでは,大規模で複雑なニューラルネットワークを必要とせず,軽量なアダプタを用いてイメージエンコーダとLCMを接続する。
論文参考訳（メタデータ） (2024-07-25T06:59:15Z)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) は、異なるモデルに転送可能な視覚的プロンプトを生成するためのシンプルで効果的なアプローチである。本稿では,既存の視覚的プロンプト手法のクロスモデル特徴劣化問題に対処し,学習したプロンプトの伝達可能性を高めるための2つの戦略を提案する。
論文参考訳（メタデータ） (2024-04-17T09:39:07Z)
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [74.31268379055201]
mPLUG-Owl2は多目的なマルチモーダル言語モデルである。効果的にモダリティのコラボレーションを活用して、テキストとマルチモーダルの両方のパフォーマンスを改善する。
論文参考訳（メタデータ） (2023-11-07T14:21:29Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owlは、大規模言語モデル(LLM)にマルチモーダル能力を持たせる訓練パラダイムである。トレーニングパラダイムは、LLMの助けを借りて視覚知識を学ぶ、画像とテキストの整列のための2段階の手法を含む。実験の結果,本モデルは既存のマルチモーダルモデルよりも優れていた。
論文参考訳（メタデータ） (2023-04-27T13:27:01Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
本稿では,既存モデルの適応性を向上するための直接的な取り組みを提案し,認識を伴う言語モデルの拡張を提案する。視覚言語タスクに事前訓練されたモデルを適用するための既存のアプローチは、その効率を妨げているいくつかの重要なコンポーネントに依存している。総パラメータの99%以上を凍結し,1つの直線射影層のみをトレーニングし,1つのトレーニング可能なトークンのみを予測することにより,我々のアプローチ(eP-ALM)は,VQAとCaptioningの他のベースラインよりも有意に優れていることを示す。
論文参考訳（メタデータ） (2023-03-20T19:20:34Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。