Fugu-MT 論文翻訳(概要): Rethinking Implicit Spatial Representation in Visuomotor Policy Learning

論文の概要: Rethinking Implicit Spatial Representation in Visuomotor Policy Learning

arxiv url: http://arxiv.org/abs/2606.15232v1
Date: Sat, 13 Jun 2026 10:10:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:33.141499
Title: Rethinking Implicit Spatial Representation in Visuomotor Policy Learning
Title（参考訳）: ビジュモータ政策学習における不必要空間表現の再考
Authors: Xiangyu Chen, Yuxuan Hu, Chuhao Zhou, Jianfei Yang,
Abstract要約: 空間的ソフトマックスに基づく表現は、以前のビジュモータ政策で採用されているが、その効果と基盤となるメカニズムは未だ十分に理解されていない。このような暗黙的な空間表現は、ロボット操作に効果的で安定した視覚的特徴を提供するだろうか? マルチスケールな暗黙空間情報をトップダウン・クロスアテンション・フュージョンで保存するビジュアルエンコーダであるPRISMを提案する。
参考スコア（独自算出の注目度）: 22.442124852908908
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative model-based imitation learning has become a widely adopted paradigm for robotic manipulation, where policy performance depends critically on the conditioned visual representations. Although spatial softmax-based representations have been adopted in prior visuomotor policies, their effectiveness and underlying mechanisms remain insufficiently understood. This work rethinks the use of spatial softmax pooling: do such implicit spatial representations provide effective and stable visual features for robotic manipulation? Through systematic studies of different pooling methods in visual encoders, we find that this pooling operation produces compact and stable spatial representations, which outperform feature-value representations, despite using substantially fewer dimensions. Complementary saliency analysis further suggests that these spatial representations guide the encoder to focus more consistently on task-relevant regions. However, this advantage is limited by a representation bottleneck in current visual encoders: repeated downsampling operations weaken fine-grained spatial information before the action-generation module can use it, especially under low-resolution observations. Motivated by these findings, we propose PRISM, a visual encoder that preserves multiscale implicit spatial information through top-down cross-attention fusion. Experiments across multiple tasks and policy backbones show consistent improvements. In particular, on the low-resolution, high-precision ToolHang task, PRISM shows clear gains, improving the average success rate from 5.0% to 13.4% while increasing parameters by only 15.4%. These results support the use of multiscale implicit spatial representations as an effective and efficient design principle for robotic manipulation.
Abstract（参考訳）: 生成モデルに基づく模倣学習はロボット操作のパラダイムとして広く採用されている。空間的ソフトマックスに基づく表現は、以前のビジュモータ政策で採用されているが、その効果と基盤となるメカニズムは未だ十分に理解されていない。このような暗黙的な空間表現は、ロボット操作に効果的で安定した視覚的特徴を提供するだろうか? 視覚エンコーダにおける異なるプーリング手法の系統的研究により、このプーリング操作は、ほぼ少ない次元のにもかかわらず、特徴値表現よりも優れたコンパクトで安定した空間表現を生成することが判明した。補足サリエンシ分析により,これらの空間表現がエンコーダをタスク関連領域に一貫した集中に導くことが示唆された。しかし、この利点は現在の視覚エンコーダにおける表現ボトルネックによって制限されており、特に低分解能観測下では、アクションジェネレーションモジュールがそれを使用する前に、繰り返しダウンサンプリング操作によってきめ細かな空間情報が弱まる。これらの知見に触発されたPRISMは,マルチスケールの暗黙的空間情報をトップダウン・クロスアテンション・フュージョンを通じて保存するビジュアルエンコーダである。複数のタスクとポリシーバックボーンにわたる実験は、一貫した改善を示している。特に、低解像度で高精度なToolHangタスクでは、PRISMは明確な利得を示し、平均成功率を5.0%から13.4%に改善し、パラメータを15.4%増加させた。これらの結果は、ロボット操作のための効率的かつ効率的な設計原理として、多スケールの暗黙的空間表現の使用を支援する。

論文の概要: Rethinking Implicit Spatial Representation in Visuomotor Policy Learning

関連論文リスト