Fugu-MT 論文翻訳(概要): Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

論文の概要: Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

arxiv url: http://arxiv.org/abs/2603.09484v1
Date: Tue, 10 Mar 2026 10:39:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.235772
Title: Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion
Title（参考訳）: 自己注意エンコーディングとコーディネート保存融合を用いた成分認識型スケッチ・ツー・イメージ生成
Authors: Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz Qureshi,
Abstract要約: フリーハンドスケッチをフォトリアリスティックな画像に変換することは、画像合成の根本的な課題である。 GANベースのモデルや拡散ベースのモデルを含む既存のアプローチは、細かな細部を再構築したり、空間的アライメントを維持したり、異なるスケッチ領域に適応するのに苦労することが多い。本稿では,新しい2段階アーキテクチャを用いて,これらの課題に対処するスケッチ・ツー・イメージ生成のための,コンポーネント対応の自己修正フレームワークを提案する。
参考スコア（独自算出の注目度）: 2.510998372750843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
Abstract（参考訳）: フリーハンドスケッチをフォトリアリスティックなイメージに変換することは、特に抽象的、疎外的で、スタイリスティックに多彩なスケッチの性質のために、画像合成における根本的な課題である。 GANベースのモデルや拡散ベースのモデルを含む既存のアプローチは、細かな細部を再構築したり、空間的アライメントを維持したり、異なるスケッチ領域に適応するのに苦労することが多い。本稿では,この課題に対処するスケッチ・ツー・イメージ・ジェネレーションのためのコンポーネント・アウェア・セルフ・リフィニング・フレームワークを,新しい2段階アーキテクチャにより提案する。 Self-Attention-based Autoencoder Network (SA2N) はまずコンポーネントのスケッチ領域から局所的な意味的特徴と構造的特徴をキャプチャし、Coordinate-Preserving Gated Fusion (CGF) モジュールはこれらをコヒーレントな空間レイアウトに統合する。最後に、改良されたStyleGAN2バックボーン上に構築された空間適応リファインメント・リバイザ(SARR)は、空間コンテキストによってガイドされる反復的リファインメントを通じてリアリズムと一貫性を高める。顔 (CelebAMask-HQ, CUFSF) と非顔 (Sketchy, ChairsV2, ShoesV2) の両方にわたる広範囲な実験により, 本手法の堅牢性と一般化性を示した。提案手法は,最新のGANと拡散モデルより一貫して優れており,画像の忠実度,意味的精度,知覚的品質が著しく向上している。 CelebAMask-HQでは,従来の手法を21% (FID), 58% (IS), 41% (KID), 20% (SSIM) 改善した。これらの結果は、様々な領域にわたる高い効率性と視覚的コヒーレンスとともに、我々のアプローチを、法医学、デジタルアート復元、一般的なスケッチに基づく画像合成における強力な候補として位置づけている。

論文の概要: Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

関連論文リスト