Fugu-MT 論文翻訳(概要): HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

論文の概要: HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

arxiv url: http://arxiv.org/abs/2603.07484v1
Date: Sun, 08 Mar 2026 05:58:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.052476
Title: HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter
Title（参考訳）: HSC-VLA:高密度クラッタにおけるロバストなバイマニピュレーションのための階層的シーンクリーニング
Authors: Zhen Liu, Xinyu Ning, Zhe Hu, XinXin Xie, Yitong Liu, Zhongzhu Pu,
Abstract要約: HSC-VLAは階層的なフレームワークであり、低レベル、高周波数の感覚運動子実行から高レベルの視覚的意味論を分離する。密に散らばったスーパーマーケット棚の実験では、高密度のクラッタの下でHSC-VLAが86.7%の集合的な成功を達成している。
参考スコア（独自算出の注目度）: 8.30676926154535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only mask-filtered vision and proprioception. Extensive experiments in densely cluttered supermarket shelves demonstrate that HSC-VLA achieves 86.7\% aggregate success under high-density clutter, surpassing the best monolithic baseline ($π_0$-Full FT at 34.3\%) by 52.4\%. HSC-VLA also exhibits strong long-horizon performance, reaching 72\% on clutter sorting and 66\% on restocking, demonstrating strong robustness and effective failure recovery in complex cluttered manipulation.
Abstract（参考訳）: 現代のビジョン-言語--アクションモデルは、タスク非関連の視覚的クラッタが注意を希釈し、座屈を悪化させ、複雑なロングホライゾンシナリオにおけるパフォーマンスを著しく低下させるような、高密度な操作環境において、重要な命令追従障害に悩まされることが多い。モノリシックなエンド・ツー・エンドアーキテクチャの表現ボトルネックを克服するために,低レベル,高周波数のセンタモレータによる高次視覚意味推論を明示的なシーンクリーニング抽象化によって分離する階層型フレームワークであるHSC-VLAを提案する。 HSC-VLAは高レベルのBrainを用いて、長距離タスクを分解し、タスク関連幾何を保存するタスク固有のシーンマスクを生成する。フィルターされた観察は、低レベルの小脳に渡される。これは拡散に基づくポリシーで、マスクフィルターされた視覚とプロプレセプションのみを使用してバイマン的操作を行う。 HSC-VLAは高密度クラッタで86.7 %の集合的成功を達成し、最高のモノリシックベースライン(π_0$-Full FT=34.3 %)を52.4 %超えた。 HSC-VLAはまた、強い長距離性能を示し、クラッタソートでは72\%、リストックでは66\%に達し、複雑なクラッタ操作では強い堅牢性と効果的な障害回復を示す。

論文の概要: HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter

関連論文リスト