Fugu-MT 論文翻訳(概要): The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

論文の概要: The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

arxiv url: http://arxiv.org/abs/2604.03191v1
Date: Fri, 03 Apr 2026 17:06:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.547072
Title: The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
Title（参考訳）: 圧縮ギャップ:なぜ離散的トークン化が視覚・言語・アクションモデルスケーリングを制限するのか
Authors: Takuya Shiba,
Abstract要約: ビジョンエンコーダのアップグレードによるVLAモデルのスケールアップにより,下流操作性能が向上することが期待される。この期待は、アクションが離散トークンとして表現されるときに失敗することを示す。任意のビジュモータパイプラインでは、スケーリングの振る舞いは、最も厳しい情報のボトルネックの位置によって管理される。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
Abstract（参考訳）: ビジョンエンコーダのアップグレードによるVLAモデルのスケーリングは、視覚言語モデルと同様に、下流での操作性能を改善することが期待されている。この期待は、アクションが離散トークンとして表現されるときに失敗することを示し、なぜ情報理論の原則をCompression Gapと呼ぶのかを説明する。アクションが連続している場合(例えば拡散ポリシー)、視覚エンコーダはバインディングの制約であり、それをアップグレードすることでパフォーマンスが向上する。アクションが固定容量のコードブック(例えばOAT)で識別されると、コードブックはバインディングの制約となり、エンコーダの改善は、上流の表現がどれほど豊かであるかに関わらず、それを伝播できない。我々は,この原理をLIBEROベンチマークで検証し,エンコーダのアップグレードが21パーセント以上向上することを示す因子的実験,OATゲインがモデルスケールで実質的に減衰する一方で,OATゲインがモデルスケールで低下することを示すエンコーダの品質勾配,OATがフラットである間,Diffusion Policyがエンコーダの品質を単調に追跡することを示すエンコーダの品質勾配,コードブックの容量の緩和が部分的にエンコーダの感度を回復することを示すコードブックサイズ実験,およびボトルネック仮説の因果的証拠を提供する。我々の研究結果によると、物理AIのスケーリングには、モデルやデータサイズを均一に増加させるのではなく、パイプラインのどこにボトルネックがあるかを特定する必要がある。

論文の概要: The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

関連論文リスト