Fugu-MT 論文翻訳(概要): InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

論文の概要: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

arxiv url: http://arxiv.org/abs/2312.06738v2
Date: Sat, 30 Dec 2023 23:04:37 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-03 19:27:48.397941
Title: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following
Title（参考訳）: InstructAny2Pix: マルチモーダルインストラクションによるフレキシブルなビジュアル編集
Authors: Shufan Li, Harkanwar Singh, Aditya Grover
Abstract要約: InstructAny2Pixは、ユーザが音声、画像、テキストを含む命令を使って入力画像を編集できるフレキシブルなマルチモーダル命令フォローシステムである。本システムでは,命令誘導型編集タスクを複数実施できることを実証する。
参考スコア（独自算出の注目度）: 29.735659054029387
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at https://github.com/jacklishufan/InstructAny2Pix.git
Abstract（参考訳）: 視覚画像の生成と編集のためのきめ細かい制御を提供する能力は、コンピュータビジョンとその応用に大きな影響を及ぼす。以前の研究では、テキストベースのプロンプトによる命令チューニングとマルチモーダルコンディショニングという2つの方向の制御可能性の拡張を検討している。しかし、これらの研究は、可制御性を表現するために使われる数および/またはモダリティ入力のタイプについて1つ以上の不自然な仮定を下している。 instructany2pixは,音声,画像,テキストを含む命令を用いて入力画像を編集可能な,柔軟なマルチモーダル命令追従システムである。 instructany2pixは3つのビルディングブロックで構成されており、画像やオーディオなどの異なるモダリティを統一された潜在空間にエンコードするマルチモーダルエンコーダ、この潜在空間の表現を画像にデコードすることを学ぶ拡散モデル、複数の画像やオーディオピースを含む命令を理解し、所望の出力の条件付き埋め込みを生成するマルチモーダルllmである。さらに,学習効率の向上と生成品質の向上を図るため,LCM出力の視覚的品質を高めるための事前モジュールを付加する。これらの設計はシステムの性能に極めて重要である。本システムは,一連の新しい命令誘導編集タスクを実行できることを実証する。コードはhttps://github.com/jacklishufan/instructany2pix.gitで入手できる。

関連論文リスト

Omni-Video: Democratizing Unified Video Understanding and Generation [13.616454543808798]
本報告では,映像理解,生成,および命令ベースの編集のための効率的かつ効果的な統合フレームワークであるOmni-Videoについて述べる。我々の重要な洞察は、拡散デコーダの入力として使用される連続的な視覚的手がかりを生成するために、既存のマルチモーダル大言語モデル(MLLM)を教えることである。統合ビデオモデリングシステムの可能性を完全に解き明かすため,いくつかの技術的改善を取り入れた。
論文参考訳（メタデータ） (2025-07-08T16:02:16Z)
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation [7.61087111021017]
本稿では,Diffusion Transformer (DiT) モデルに様々なモダリティを理解する能力を備えた X2I フレームワークを提案する。 X2Iは,マルチモーダル理解能力を有しながら,1%未満の性能低下を示した。
論文参考訳（メタデータ） (2025-03-08T09:07:45Z)
Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
視覚言語モデルを強化するための視覚理解能力の改善に注力する。マルチモーダル言語モデルである textbfArcana を提案する。
論文参考訳（メタデータ） (2024-10-17T16:36:38Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
TWIST と SCOUT は,事前学習したMLLM に視覚的接地能力を持たせるフレームワークである。モデルを効果的に微調整するために,SCOUTと呼ばれる高品質な合成データセットを生成する。このデータセットは、ステップバイステップのマルチモーダル推論プロセスを記述する、豊富な監視信号を提供する。
論文参考訳（メタデータ） (2024-10-14T13:35:47Z)
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-Gは条件付き拡散フレームワークであり、インターリーブされたテキストと視覚入力を持つマルチモーダルプロンプトで動作する。テキスト・ツー・イメージ生成とゼロショット・テーマ駆動合成の両面で優れている。
論文参考訳（メタデータ） (2024-01-24T11:36:44Z)
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation [59.24938416319019]
InstructSeqは命令条件付きマルチモーダルモデリングフレームワークである。柔軟な自然言語制御と視覚データとテキストデータの扱いにより、多様な視覚タスクを統一する。
論文参考訳（メタデータ） (2023-11-30T18:59:51Z)
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts [14.359111652624899]
異なるモダリティやドメインにまたがって、異なる基礎モデルの専門知識を活用するモジュラーフレームワークを提案する。我々のアプローチは、分散化されたコマンド実行を可能にし、各モデルが他のモデルの専門知識から貢献と利益を得られるようにします。提案手法は,画像と音声が付与され,提供された音声のコンテキスト内で画像を記述するテキストを生成する,新たなタスクである音声認識画像キャプションで実証する。
論文参考訳（メタデータ） (2023-10-25T22:36:40Z)
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [115.50132185963139]
CM3Leonはデコーダのみのマルチモーダル言語モデルであり、テキストと画像の両方を生成および埋め込むことができる。これは、テキストのみの言語モデルに適応したレシピで訓練された最初のマルチモーダルモデルである。 CM3Leonは、同等の手法よりも5倍少ないトレーニング計算で、テキストから画像生成における最先端のパフォーマンスを実現する。
論文参考訳（メタデータ） (2023-09-05T21:27:27Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vidは、人間の言語命令でガイドされたビデオ編集のためのエンドツーエンドの拡散ベースの方法論である。我々のアプローチは、自然言語ディレクティブによって案内される映像操作を強化し、サンプルごとの微調整や逆変換の必要性を排除します。
論文参考訳（メタデータ） (2023-05-21T03:28:13Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owlは、大規模言語モデル(LLM)にマルチモーダル能力を持たせる訓練パラダイムである。トレーニングパラダイムは、LLMの助けを借りて視覚知識を学ぶ、画像とテキストの整列のための2段階の手法を含む。実験の結果,本モデルは既存のマルチモーダルモデルよりも優れていた。
論文参考訳（メタデータ） (2023-04-27T13:27:01Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
本稿では,ビデオ+テキスト,音声,音声によるマルチモーダル入力からテキストを生成するフレームワークを提案する。実験により、一つのアーキテクチャに基づくアプローチは、3つのビデオベースのテキスト生成タスクにおいて最先端のタスクより優れていることが示された。
論文参考訳（メタデータ） (2021-01-28T15:22:36Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。