Fugu-MT 論文翻訳(概要): BIGFix: Bidirectional Image Generation with Token Fixing

論文の概要: BIGFix: Bidirectional Image Generation with Token Fixing

arxiv url: http://arxiv.org/abs/2510.12231v1
Date: Tue, 14 Oct 2025 07:34:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.227368
Title: BIGFix: Bidirectional Image Generation with Token Fixing
Title（参考訳）: BIGFix:トークン固定による双方向画像生成
Authors: Victor Besnier, David Hurych, Andrei Bursuc, Eduardo Valle,
Abstract要約: サンプルトークンを反復精製することで画像生成を自己補正する手法を提案する。我々は,ランダムトークンを文脈に注入し,ロバスト性を向上し,サンプリング中のトークンの固定を可能にする,新しいトレーニング手法によりこれを実現する。我々は、ImageNet-256とCIFAR-10データセットを用いた画像生成と、UCF-101とNuScenesによるビデオ生成のアプローチを評価し、両モード間で大幅に改善した。
参考スコア（独自算出の注目度）: 21.40682276355247
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in image and video generation have raised significant interest from both academia and industry. A key challenge in this field is improving inference efficiency, as model size and the number of inference steps directly impact the commercial viability of generative models while also posing fundamental scientific challenges. A promising direction involves combining auto-regressive sequential token modeling with multi-token prediction per step, reducing inference time by up to an order of magnitude. However, predicting multiple tokens in parallel can introduce structural inconsistencies due to token incompatibilities, as capturing complex joint dependencies during training remains challenging. Traditionally, once tokens are sampled, there is no mechanism to backtrack and refine erroneous predictions. We propose a method for self-correcting image generation by iteratively refining sampled tokens. We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling. Our method preserves the efficiency benefits of parallel token prediction while significantly enhancing generation quality. We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.
Abstract（参考訳）: 画像生成とビデオ生成の最近の進歩は、アカデミックと産業の両方から大きな関心を集めている。この分野での重要な課題は、モデルのサイズや推論ステップの数など、推論効率の改善である。有望な方向は、自動回帰シーケンシャルトークンモデリングとステップごとのマルチトークン予測を組み合わせることで、推論時間を最大1桁まで短縮する。しかし、複数のトークンを並列に予測することは、トークンの不整合による構造上の不整合をもたらす可能性がある。伝統的に、トークンがサンプリングされると、誤った予測をバックトラックして精査するメカニズムは存在しない。サンプルトークンを反復精製することで画像生成を自己補正する手法を提案する。我々は,ランダムトークンを文脈に注入し,ロバスト性を向上し,サンプリング中のトークンの固定を可能にする,新しいトレーニング手法によりこれを実現する。本手法は, 並列トークン予測の効率性を維持しつつ, 生成品質を大幅に向上させる。我々は、ImageNet-256とCIFAR-10データセットを用いた画像生成と、UCF-101とNuScenesによるビデオ生成のアプローチを評価し、両モード間で大幅に改善した。

論文の概要: BIGFix: Bidirectional Image Generation with Token Fixing

関連論文リスト