Fugu-MT 論文翻訳(概要): BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

論文の概要: BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

arxiv url: http://arxiv.org/abs/2305.14720v2
Date: Thu, 22 Jun 2023 02:36:06 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-23 17:13:45.860371
Title: BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Title（参考訳）: BLIP拡散:制御可能なテキスト・画像生成・編集のための事前学習対象表現
Authors: Dongxu Li, Junnan Li, Steven C.H. Hoi
Abstract要約: BLIP-Diffusionはマルチモーダル制御をサポートする新しい主観駆動画像生成モデルである。他の主観駆動生成モデルとは異なり、BLIP-Diffusionは主観表現を提供するために事前訓練された新しいマルチモーダルエンコーダを導入する。
参考スコア（独自算出の注目度）: 73.74570290836152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project page at https://dxli94.github.io/BLIP-Diffusion-website/.
Abstract（参考訳）: 主題駆動テキストから画像への生成モデルは、テキストプロンプトに基づいて、入力対象の新しいランディションを生成する。既存のモデルは長い微調整に苦しめられ、主題の忠実さを保つのが困難である。これらの制約を克服するために,対象画像とテキストプロンプトの入力を消費するマルチモーダル制御をサポートする新たな対象駆動画像生成モデルBLIP-Diffusionを導入する。他の主題駆動生成モデルとは異なり、blip-diffusionは新しいマルチモーダルエンコーダを導入している。まず、BLIP-2に従ってマルチモーダルエンコーダを事前学習し、テキストに沿った視覚表現を生成する。そこで我々は,そのような視覚的表現を拡散モデルで活用し,新たな主題の活用を可能にする主観表現学習タスクを設計する。 dreamboothのような従来の方法と比較して,本モデルでは最大20倍のスピードアップを実現することで,ゼロショットの主題駆動生成と効率的な微調整が可能となる。また, BLIP-Diffusion と ControlNet や prompt-to-prompt といった既存の手法を柔軟に組み合わせることで, 新規な主題駆動型生成・編集アプリケーションを実現できることを示す。コードとモデルはhttps://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusionでリリースされる。プロジェクトページ: https://dxli94.github.io/blip-diffusion-website/

関連論文リスト

Unified Multimodal Discrete Diffusion [78.48930545306654]
複数のモードをまたいだ理解と生成が可能なマルチモーダル生成モデルは、自己回帰(AR)アプローチによって支配される。共同テキストと画像領域の統一的な生成形式としての離散拡散モデルについて検討する。テキストと画像の共同理解・生成が可能なUnified Multimodal Discrete Diffusion (UniDisc) モデルを提案する。
論文参考訳（メタデータ） (2025-03-26T17:59:51Z)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
本稿では,画像生成モデルにおける任意のテキストイメージインターリーブド制御のための効率的なフレームワークであるDream Engineを提案する。提案手法は,テキスト・イメージアライメントとマルチモーダル・インターリーブド・インストラクション・チューニングからなる2段階の訓練パラダイムを利用する。本手法は,GenEvalベンチマークで0.69点の総合スコアを達成し,有効であることを示す。
論文参考訳（メタデータ） (2025-02-27T15:08:39Z)
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
視覚生成タスクのための全ラウンドクリエータとエディタであるACEを提案する。まず、Long-Context Condition Unit (LCU)と呼ばれる統一条件形式を導入する。次に,LCUを入力として使用するトランスフォーマーに基づく新しい拡散モデルを提案する。
論文参考訳（メタデータ） (2024-09-30T17:56:27Z)
Generating Images with Multimodal Language Models [78.6660334861137]
本稿では,凍結したテキストのみの大規模言語モデルを,事前学習した画像エンコーダとデコーダモデルで融合する手法を提案する。本モデルでは,画像検索,新しい画像生成,マルチモーダル対話など,多モーダルな機能群を示す。
論文参考訳（メタデータ） (2023-05-26T19:22:03Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
本研究は,テキストから画像への拡散モデルにおいて,迅速な理解能力を高めることを提案する。提案手法は,新たな2段階プロセスにおいて,事前訓練された大規模言語モデルを用いてグラウンドド生成を行う。提案手法は,画像の正確な生成において,ベース拡散モデルといくつかの強いベースラインを著しく上回る。
論文参考訳（メタデータ） (2023-05-23T03:59:06Z)
In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
本稿では,拡散に基づく生成モデルにおいて,文脈内学習を可能にするフレームワークであるPrompt Diffusionを提案する。本稿では,幅広い視覚言語タスクをモデル化可能な視覚言語プロンプトと,それを入力とする拡散モデルを提案する。結果として得られるPrompt Diffusionモデルは、文脈内学習が可能な初めての拡散に基づく視覚言語基礎モデルである。
論文参考訳（メタデータ） (2023-05-01T23:03:37Z)
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
拡散過程に基づくテキスト・ツー・イメージ(T2I)モデルは,ユーザが提供するキャプションを用いた制御可能な画像生成において顕著な成功を収めた。現在のテキストエンコーダとT2Iモデルのイメージデコーダの密結合により、置き換えやアップグレードが困難になる。本稿では,新しいGlueNetモデルを適用したGlueGenを提案する。
論文参考訳（メタデータ） (2023-03-17T15:37:07Z)
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [87.52504764677226]
大規模拡散に基づく生成モデルは、テキスト条件の高解像度画像合成においてブレークスルーをもたらした。異なる段階合成に特化したテキスト・画像拡散モデルのアンサンブルを訓練する。 eDiffiと呼ばれる拡散モデルのアンサンブルは、同じ推論コストを維持しながらテキストアライメントを改善する。
論文参考訳（メタデータ） (2022-11-02T17:43:04Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。