Fugu-MT 論文翻訳(概要): LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

論文の概要: LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

arxiv url: http://arxiv.org/abs/2412.15188v1
Date: Thu, 19 Dec 2024 18:56:24 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-20 18:44:16.267226
Title: LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
Title（参考訳）: LlamaFusion:マルチモーダル生成のための事前学習言語モデルへの適応
Authors: Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu,
Abstract要約: LlamaFusionは、事前訓練されたテキストのみの大規模言語モデル(LLM)にマルチモーダル生成機能を持たせるためのフレームワークである。 LlamaFusionは画像理解を20%改善し,画像生成を3.6%改善し,FLOPの50%しか利用していない。
参考スコア（独自算出の注目度）: 81.78257799283777
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.
Abstract（参考訳）: LlamaFusionは、事前訓練されたテキストのみの大規模言語モデル(LLM)をマルチモーダル生成機能で強化し、任意のシーケンスでテキストと画像の両方を理解・生成するフレームワークである。 LlamaFusionは、既存のLlama-3の重みを利用してテキストを自動回帰処理し、拡散した画像を処理するための追加および並列トランスフォーマーモジュールを導入している。モダリティ固有のフィードフォワード層、クエリキー値のプロジェクション、正規化層は各モダリティを独立して処理し、共有された自己認識層はテキストと画像の特徴間の相互作用を可能にする。テキスト固有のモジュールを凍結し、イメージ固有のモジュールのみをトレーニングすることにより、LlamaFusionは、強力な視覚的理解と生成能力を開発しながら、テキストのみのLLMの言語能力を保っている。 Llama-3の言語能力を維持しながらFLOPの50%しか使用せず,LlamaFusionは画像理解を20%改善し,画像生成を3.6%改善する。また、このフレームワークは、既存の視覚言語モデルにマルチモーダル生成能力で適応できることを示す。全体として、このフレームワークはテキストのみのLLMへの既存の計算投資だけでなく、言語と視覚能力の並列開発を可能にし、効率的なマルチモーダルモデル開発のための有望な方向性を示す。

関連論文リスト

ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy [14.703591553247948]
ARMORは、既存のマルチモーダルな大規模言語モデルを微調整することで、理解と生成の両方を達成するフレームワークである。 ARMORは、モデルアーキテクチャ、トレーニングデータ、トレーニングアルゴリズムの3つの観点から既存のMLLMを拡張している。実験により、ARMORは既存のMLLMをUniMにアップグレードし、将来性のある画像生成機能を持つことを示した。
論文参考訳（メタデータ） (2025-03-09T10:15:39Z)
Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models [43.16111789538798]
大規模マルチモーダルモデル(LMM)の多言語機能を活用した並列多言語プロンプトを構築する。 3つのベンチマークにおける2つのLMM実験により,提案手法であるPMT2Iが,一般に優れた性能,構成,きめ細かな評価を達成できることが判明した。
論文参考訳（メタデータ） (2025-01-13T06:41:23Z)
Liquid: Language Models are Scalable Multi-modal Generators [112.71734051183726]
Liquidは視覚的理解と生成をシームレスに統合する自動回帰生成パラダイムである。従来のマルチモーダルな大言語モデル(MLLM)とは異なり、Liquidは単一の大言語モデルを用いてこの統合を実現する。初めてLiquidは、ビジュアルタスクと言語タスクの統一トレーニングによって必然的にパフォーマンスが低下する、スケーリングの法則を明らかにした。
論文参考訳（メタデータ） (2024-12-05T16:48:16Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
LLMを事前訓練された視覚モデルに組み込んだマルチモーダル大規模言語モデル(MLLM)は、近年、多様な視覚言語タスクにまたがる印象的なパフォーマンスを実証している。しかし、複数の画像を含む文脈を理解するには不十分である。本稿では,2つのフェーズ・パラダイムであるブラウズ・アンド・集中型を提案し,より深いマルチモーダルコンテキスト融合を実現する。
論文参考訳（メタデータ） (2024-02-19T14:59:07Z)
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [115.50132185963139]
CM3Leonはデコーダのみのマルチモーダル言語モデルであり、テキストと画像の両方を生成および埋め込むことができる。これは、テキストのみの言語モデルに適応したレシピで訓練された最初のマルチモーダルモデルである。 CM3Leonは、同等の手法よりも5倍少ないトレーニング計算で、テキストから画像生成における最先端のパフォーマンスを実現する。
論文参考訳（メタデータ） (2023-09-05T21:27:27Z)
Generating Images with Multimodal Language Models [78.6660334861137]
本稿では,凍結したテキストのみの大規模言語モデルを,事前学習した画像エンコーダとデコーダモデルで融合する手法を提案する。本モデルでは,画像検索,新しい画像生成,マルチモーダル対話など,多モーダルな機能群を示す。
論文参考訳（メタデータ） (2023-05-26T19:22:03Z)
MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation [21.455774034659978]
MultiFusionは、複数のモダリティと言語を任意にインターリーブした入力で複雑な概念を表現することができる。 MutliFusionは、事前訓練されたモデルを活用し、それらを結合システムに統合するために調整する。
論文参考訳（メタデータ） (2023-05-24T16:22:18Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owlは、大規模言語モデル(LLM)にマルチモーダル能力を持たせる訓練パラダイムである。トレーニングパラダイムは、LLMの助けを借りて視覚知識を学ぶ、画像とテキストの整列のための2段階の手法を含む。実験の結果,本モデルは既存のマルチモーダルモデルよりも優れていた。
論文参考訳（メタデータ） (2023-04-27T13:27:01Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。