Fugu-MT 論文翻訳(概要): TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

論文の概要: TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

arxiv url: http://arxiv.org/abs/2510.16332v1
Date: Sat, 18 Oct 2025 03:36:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.951095
Title: TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement
Title（参考訳）: TokenAR: 自己回帰型Tokenレベルエンハンスメントによる複数対象生成
Authors: Haiyue Sun, Qingdong He, Jinlong Peng, Peng Tang, Jiangning Zhang, Junwei Zhu, Xiaobin Hu, Shuicheng Yan,
Abstract要約: TokenARは、参照ID混乱問題に対処するための、単純だが効果的なトークンレベル拡張機構である。 Token Injectionのインストラクションは、参照トークンの詳細なおよび補完的な事前を注入する、余分な視覚的特徴コンテナの役割として機能する。 Identity-token disentanglement Strategy(ITD)は、トークン表現を個々のアイデンティティの特徴を独立に表現するために明示的にガイドする。
参考スコア（独自算出の注目度）: 87.82338951215131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive Model (AR) has shown remarkable success in conditional image generation. However, these approaches for multiple reference generation struggle with decoupling different reference identities. In this work, we propose the TokenAR framework, specifically focused on a simple but effective token-level enhancement mechanism to address reference identity confusion problem. Such token-level enhancement consists of three parts, 1). Token Index Embedding clusters the tokens index for better representing the same reference images; 2). Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens; 3). The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.This token-enhancement framework significantly augments the capabilities of existing AR based methods in conditional image generation, enabling good identity consistency while preserving high quality background reconstruction. Driven by the goal of high-quality and high-diversity in multi-subject generation, we introduce the InstructAR Dataset, the first open-source, large-scale, multi-reference input, open domain image generation dataset that includes 28K training pairs, each example has two reference subjects, a relative prompt and a background with mask annotation, curated for multiple reference image generation training and evaluating. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in multiple reference image generation task. The implementation code and datasets will be made publicly. Codes are available, see https://github.com/lyrig/TokenAR
Abstract（参考訳）: 自己回帰モデル(AR)は条件付き画像生成において顕著な成功を収めた。しかし、これら複数の参照生成に対するアプローチは、異なる参照IDの分離に苦労する。本研究では,参照ID混同問題に対処する,単純かつ効果的なトークンレベル拡張機構に着目したTokenARフレームワークを提案する。このようなトークンレベルの拡張は、(1)の3つの部分から構成される。 Token Index Embedding cluster the tokens Index for better representation the same reference images; 2)。 Token Injectionは、参照トークンの詳細なおよび補完的な事前を注入するための、余分な視覚的特徴コンテナの役割として機能する。このフレームワークは、条件付き画像生成における既存のAR手法の能力を著しく強化し、高品質な背景復元を保ちながら、良好なアイデンティティ整合性を実現する。 InstructAR Datasetは,28Kのトレーニングペアを含むオープンドメイン画像生成データセットであり,各サンプルには2つの参照対象,相対的なプロンプトとマスクアノテーション付き背景があり,複数の参照画像生成トレーニングと評価のためにキュレートされている。総合的な実験により,複数の参照画像生成タスクにおいて,我々のアプローチが現在の最先端モデルを上回ることが検証された。実装コードとデータセットは公開されます。コードはhttps://github.com/lyrig/TokenARを参照してください。

論文の概要: TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

関連論文リスト