Fugu-MT 論文翻訳(概要): DreamOmni2: Multimodal Instruction-based Editing and Generation

論文の概要: DreamOmni2: Multimodal Instruction-based Editing and Generation

arxiv url: http://arxiv.org/abs/2510.06679v1
Date: Wed, 08 Oct 2025 06:07:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.325428
Title: DreamOmni2: Multimodal Instruction-based Editing and Generation
Title（参考訳）: DreamOmni2:マルチモーダルなインストラクションベースの編集と生成
Authors: Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia,
Abstract要約: マルチモーダルな命令ベースの編集と生成という2つの新しいタスクを提案する。これらのタスクはテキストとイメージの命令の両方をサポートし、具体的概念と抽象概念の両方を含むようにスコープを拡張する。データ合成パイプラインは,(1)抽象的概念と具体的概念の両方の抽出データを作成するための特徴混合法,(2)編集と抽出モデルを用いたマルチモーダル命令ベースの編集訓練データを生成すること,(3)抽出モデルを適用してマルチモーダル命令ベースの編集のためのトレーニングデータを生成すること,の3つのステップで構成されている。
参考スコア（独自算出の注目度）: 77.997848231822
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
Abstract（参考訳）: 命令ベースの画像編集と主観的生成の最近の進歩は大きな注目を集めているが、どちらのタスクも実際のユーザニーズを満たす際の限界に直面している。インストラクションベースの編集は言語命令のみに依存しており、特定の編集の詳細をキャプチャできず、参照画像が必要な場合が多い。一方、主題駆動生成は、より広く抽象的な概念を見渡すことで、具体的オブジェクトや人の組み合わせに限られる。これらの課題に対処するために,マルチモーダルな命令ベースの編集と生成という2つの新しいタスクを提案する。これらのタスクは、テキストとイメージの命令の両方をサポートし、具体的および抽象的な概念の両方を含む範囲を拡張し、実践的応用を大幅に強化する。私たちはDreamOmni2を導入し、データ生成とモデルフレームワーク設計という2つの大きな課題に取り組みました。データ合成パイプラインは,(1)抽象的概念と具体的概念の両方の抽出データを生成するための特徴混合法,(2)編集と抽出モデルを用いたマルチモーダル命令ベースの編集訓練データを生成すること,(3)抽出モデルを適用してマルチモーダル命令ベースの編集のためのトレーニングデータを生成すること,の3つのステップで構成されている。マルチイメージ入力を扱うためのフレームワークとして,画像の識別と画素の混同を避けるために,インデックス符号化と位置符号化のシフトスキームを提案する。さらに、複雑な命令の処理を改善するために、VLMと生成/編集モデルとの共同トレーニングを導入する。さらに、我々はこれらの2つの新しいタスクの包括的なベンチマークを提案し、開発を進めました。実験によると、DreamOmni2はすばらしい結果を得た。モデルとコードはリリースされる。

論文の概要: DreamOmni2: Multimodal Instruction-based Editing and Generation

関連論文リスト