Fugu-MT 論文翻訳(概要): StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

論文の概要: StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

arxiv url: http://arxiv.org/abs/2604.21052v1
Date: Wed, 22 Apr 2026 19:52:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.161279
Title: StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling
Title（参考訳）: StyleVAR:視覚的自己回帰モデリングによる制御可能なイメージスタイル転送
Authors: Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu,
Abstract要約: 学習された潜在空間における条件付き離散シーケンスモデルとしてスタイル転送を定式化する。我々は、進化する対象表現が自身の歴史に沿うような混在したクロスアテンション機構を導入する。 Style VARは、Style Loss、Content Loss、LPIPS、SSIM、DreamSim、CLIPの類似性において一貫してAdaINベースラインを上回っている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.
Abstract（参考訳）: 我々は,Visual Autoregressive Modeling (VAR) フレームワークを構築し,学習された潜在空間における条件付き離散シーケンスモデリングとしてスタイル転送を定式化する。画像はマルチスケールの表現に分解され、VQ-VAEによって離散コードにトークン化される。スタイルとコンテンツ情報をインジェクトするために,進化する対象表現が自身の歴史に付随するような混在したクロスアテンション機構を導入し,スタイルとコンテンツ機能は,この歴史のどの側面を強調するかを決定するクエリとして機能する。スケール依存ブレンディング係数は、各ステージにおけるスタイルと内容の相対的な影響を制御し、合成された表現は、VARの自己回帰連続性を損なうことなく、内容構造とスタイルテクスチャの両方に整合するように促す。トレーニング済みのVARチェックポイントから,StyleVARを2段階に分けてトレーニングする: コンテンツスタイルのターゲット画像のトリプルトデータセットの教師付き微調整,さらにDreamSimベースの知覚報酬に対するグループ相対ポリシー最適化(GRPO)による強化微調整,VARのマルチスケール階層間でのクレジットのバランスの緩和を目的とした,アクションごとの正規化重み付け。 StyleVARは、インイン、ニア、アウト・オブ・ディストリビューションの3つのベンチマークの中で、Style Loss、Content Loss、LPIPS、SSIM、DreamSim、CLIPの3つのベースラインを一貫して上回り、GRPOステージはSFTチェックポイントよりもさらに向上している。質的に言えば、テクスチャの伝達は、特に風景や建築シーンにおいて、セマンティックな構造を維持しつつ、テクスチャを伝達するが、インターネットイメージの一般化のギャップと人間の顔の難しさは、より良いコンテンツ多様性とより強い構造的先行性の必要性を浮き彫りにしている。

論文の概要: StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

関連論文リスト