Fugu-MT 論文翻訳(概要): Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

論文の概要: Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

arxiv url: http://arxiv.org/abs/2605.06870v2
Date: Tue, 12 May 2026 01:29:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 18:21:06.817082
Title: Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
Title（参考訳）: 連続第一、離散後:VQ-VAEsは次元的に崩壊しない
Authors: Xinyu Zhao, Nikita Karagodin, Hamed Hassani, Sinan Hersek, Paul Pu Liang, Yury Polyanskiy,
Abstract要約: 我々は、次元の崩壊が様々なコードブック改善技術が上回らないようなハードロスの低い境界を生じることを理論的かつ実証的に示す。本稿では,VQを導入する前に,不適切な自動エンコーダとしてモデルを訓練する「ウォームアップフェーズ」を提案する。合成実験と大規模画像 (VQGAN) とオーディオ (WavTokenizer) VQ-VAEs の両方において, AE Warm-Up が表現次元の復元に成功したことを示す。
参考スコア（独自算出の注目度）: 63.31488859236551
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.
Abstract（参考訳）: VQ-VAE性能を改善するための多くのアプローチは、コードブックのサイズと利用に焦点を当てているが、訓練されたVQ-VAE表現が極低次元の部分空間(フルランクの1-2%)に居住する場合の次元崩壊の影響は、いまだに未解決のままである。我々は、次元の崩壊が様々なコードブック改善技術が上回らないようなハードロスの低い境界を生じることを理論的かつ実証的に示す。我々の分析フレームワークは、速度歪み理論からアイデアを導入し、Saxe et al [2014] の逐次学習効果を拡張し、低分散方向を抑制する VQ による潜伏崩壊がどのように引き起こされるかを説明する。我々の理論は、VQを導入する前にモデルを(不適切な)オートエンコーダとして訓練する「ウォームアップフェーズ」という単純な解を正当化する。合成実験と大規模画像 (VQGAN) とオーディオ (WavTokenizer) VQ-VAEs の両方において, AE Warm-Up が表現次元の復元に成功し, 再現率と知覚損失が同じトレーニング予算で低下することを示した。コードブックサイズ$K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension to 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19, and improves PESQ by 11-14%。達成可能な最終損失を、ウォームアップ期間がいかに支配するかを実証的に特徴づける。実験結果と一致して, 暖房長関数として下流性能を予測し, AEウォームアップからVQ-VAEトレーニングへの適応基準を実現する。

論文の概要: Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

関連論文リスト