Fugu-MT 論文翻訳(概要): InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

論文の概要: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

arxiv url: http://arxiv.org/abs/2312.06738v4
Date: Thu, 17 Oct 2024 01:30:33 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-28 17:07:30.051702
Title: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following
Title（参考訳）: InstructAny2Pix: マルチモーダルインストラクションによるフレキシブルなビジュアル編集
Authors: Shufan Li, Harkanwar Singh, Aditya Grover,
Abstract要約: InstructAny2Pixは、ユーザが音声、画像、テキストを含む命令を使って入力画像を編集できるフレキシブルなマルチモーダル命令フォローシステムである。本システムでは,命令誘導型編集タスクを複数実施できることを実証する。
参考スコア（独自算出の注目度）: 26.457571615782985
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at https://github.com/jacklishufan/InstructAny2Pix.git
Abstract（参考訳）: 視覚画像の生成と編集のためのきめ細かい制御を提供する能力は、コンピュータビジョンとその応用に大きな影響を及ぼす。従来の研究では、テキストベースのプロンプトによる命令チューニングとマルチモーダル条件付けという2つの方向の制御可能性の拡張について検討されてきた。しかし、これらの研究は、可制御性を表現するために使われる数および/またはモダリティ入力のタイプについて1つ以上の不自然な仮定を下している。 InstructAny2Pixは、ユーザが音声、画像、テキストを含む命令を使って入力画像を編集できる、フレキシブルなマルチモーダル命令追従システムである。 InstructAny2Pixは、3つのビルディングブロックで構成されており、画像やオーディオなどの様々なモダリティを統一されたラテント空間にエンコードするマルチモーダルエンコーダ、このラテント空間の表現をイメージにデコードすることを学習する拡散モデル、複数の画像やオーディオ部品を含む命令を理解し、所望の出力の条件埋め込みを生成するマルチモーダルLCMである。さらに,学習効率の向上と生成品質の向上を図るため,LCM出力の視覚的品質を高めるための事前モジュールを付加する。これらの設計は、我々のシステムの性能に欠かせない。本システムでは,命令誘導型編集タスクを複数実施できることを実証する。コードはhttps://github.com/jacklishufan/InstructAny2Pix.gitで公開されている。

関連論文リスト

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation [7.61087111021017]
本稿では,Diffusion Transformer (DiT) モデルに様々なモダリティを理解する能力を備えた X2I フレームワークを提案する。 X2Iは,マルチモーダル理解能力を有しながら,1%未満の性能低下を示した。
論文参考訳（メタデータ） (2025-03-08T09:07:45Z)
Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
視覚言語モデルを強化するための視覚理解能力の改善に注力する。マルチモーダル言語モデルである textbfArcana を提案する。
論文参考訳（メタデータ） (2024-10-17T16:36:38Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
TWIST と SCOUT は,事前学習したMLLM に視覚的接地能力を持たせるフレームワークである。モデルを効果的に微調整するために,SCOUTと呼ばれる高品質な合成データセットを生成する。このデータセットは、ステップバイステップのマルチモーダル推論プロセスを記述する、豊富な監視信号を提供する。
論文参考訳（メタデータ） (2024-10-14T13:35:47Z)
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-Gは条件付き拡散フレームワークであり、インターリーブされたテキストと視覚入力を持つマルチモーダルプロンプトで動作する。テキスト・ツー・イメージ生成とゼロショット・テーマ駆動合成の両面で優れている。
論文参考訳（メタデータ） (2024-01-24T11:36:44Z)
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation [59.24938416319019]
InstructSeqは命令条件付きマルチモーダルモデリングフレームワークである。柔軟な自然言語制御と視覚データとテキストデータの扱いにより、多様な視覚タスクを統一する。
論文参考訳（メタデータ） (2023-11-30T18:59:51Z)
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts [14.359111652624899]
異なるモダリティやドメインにまたがって、異なる基礎モデルの専門知識を活用するモジュラーフレームワークを提案する。我々のアプローチは、分散化されたコマンド実行を可能にし、各モデルが他のモデルの専門知識から貢献と利益を得られるようにします。提案手法は,画像と音声が付与され,提供された音声のコンテキスト内で画像を記述するテキストを生成する,新たなタスクである音声認識画像キャプションで実証する。
論文参考訳（メタデータ） (2023-10-25T22:36:40Z)
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [115.50132185963139]
CM3Leonはデコーダのみのマルチモーダル言語モデルであり、テキストと画像の両方を生成および埋め込むことができる。これは、テキストのみの言語モデルに適応したレシピで訓練された最初のマルチモーダルモデルである。 CM3Leonは、同等の手法よりも5倍少ないトレーニング計算で、テキストから画像生成における最先端のパフォーマンスを実現する。
論文参考訳（メタデータ） (2023-09-05T21:27:27Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vidは、人間の言語命令でガイドされたビデオ編集のためのエンドツーエンドの拡散ベースの方法論である。我々のアプローチは、自然言語ディレクティブによって案内される映像操作を強化し、サンプルごとの微調整や逆変換の必要性を排除します。
論文参考訳（メタデータ） (2023-05-21T03:28:13Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owlは、大規模言語モデル(LLM)にマルチモーダル能力を持たせる訓練パラダイムである。トレーニングパラダイムは、LLMの助けを借りて視覚知識を学ぶ、画像とテキストの整列のための2段階の手法を含む。実験の結果,本モデルは既存のマルチモーダルモデルよりも優れていた。
論文参考訳（メタデータ） (2023-04-27T13:27:01Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
本稿では,ビデオ+テキスト,音声,音声によるマルチモーダル入力からテキストを生成するフレームワークを提案する。実験により、一つのアーキテクチャに基づくアプローチは、3つのビデオベースのテキスト生成タスクにおいて最先端のタスクより優れていることが示された。
論文参考訳（メタデータ） (2021-01-28T15:22:36Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。