Fugu-MT 論文翻訳(概要): SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

論文の概要: SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

arxiv url: http://arxiv.org/abs/2605.11114v1
Date: Mon, 11 May 2026 18:23:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.351716
Title: SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection
Title（参考訳）: SEVO: アクティブイルミネーションとデータ中心収集によるロバストVLA操作のためのセマンティック仮想観察
Authors: Tianchonghui Fang, Yuan Zhuang, Fei Miao,
Abstract要約: ポリシーアーキテクチャを変更することなく、環境横断操作を改善するデータ中心のアプローチであるSEVOを提案する。本稿では,多種多様なデータ収集プロトコルが,一般化の唯一の重要な要因であることを示す。本研究は, モデルスケーリングではなく, データ収集における観察設計と環境多様性を原則とし, 低コストロボットが生活環境において確実に動作できることを実証するものである。
参考スコア（独自算出の注目度）: 10.91583588660094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) and imitation-learning policies trained via community toolchains on low-cost hardware frequently fail when deployed outside the training environment. Existing evaluations, including the original ACT and SmolVLA benchmarks, demonstrate high success rates under controlled, fixed backgrounds, yet community practitioners report near-zero transfer to new environments. We present SEVO (Semantic-Enhanced Virtual Observation), a data-centric approach that improves cross-environment manipulation robustness without modifying the policy architecture. SEVO transforms the raw RGB camera stream through three mechanisms: (1) body-fixed cameras whose combined fields of view cover the full manipulation workspace, (2) active red-spectrum illumination that physically normalizes object appearance, and (3) real-time YOLO segmentation overlay that provides a background-invariant semantic cue. Critically, we show that a diversified data collection protocol (systematically varying lighting, backgrounds, and distractors during teleoperation) is the single most important factor for generalization. We target transparent water bottles, objects that visually blend with their surroundings, and select a simple pick-and-place task to enable hundreds of controlled real-robot trials across two mobile platforms. The full pipeline achieves 95% grasp success with ACT and 83% with SmolVLA in the training environment, transferring to novel environments at 85% and 75%. Without SEVO, the same policies achieve only 75%/70% in training and collapse to 30-35% in novel environments. Our results demonstrate that principled observation design and environmental diversity during data collection, not model scaling, enable low-cost robots to operate reliably in everyday household environments.
Abstract（参考訳）: VLA(Vision-Language-Action)や,コミュニティツールチェーンを通じてトレーニングされた模倣学習ポリシは,トレーニング環境外へのデプロイ時に頻繁に失敗する。 ACTとSmolVLAのベンチマークを含む既存の評価は、制御された固定された背景の下で高い成功率を示しているが、コミュニティの実践者は、新しい環境へのほぼゼロの移行を報告している。データ中心のアプローチであるSEVO(Semantic-Enhanced Virtual Observation)を提案する。 SEVO は,(1) 視界が一体化して作業空間を覆うボディー固定カメラ,(2) 物体の外観を物理的に正規化するアクティブ赤スペクトル照明,(3) 背景不変なセグメンテーションキューを提供するリアルタイムYOLO分割オーバレイの3つのメカニズムを通じて,生RGBカメラストリームを変換する。重要な点として,多彩なデータ収集プロトコル(遠隔操作における照明,背景,注意散らしなど)が,一般化の唯一の重要な要因であることを示す。透明な水のボトル、周囲と視覚的にブレンドするオブジェクト、そして2つのモバイルプラットフォームで何百ものコントロールされた実ロボットの試行を可能にするための簡単なピック・アンド・プレイス・タスクを選択します。完全なパイプラインはACTで95%、SmolVLAで83%を達成し、新しい環境に85%と75%で移行する。 SEVOがなければ、同じ方針はトレーニングで75%/70%しか達成せず、新しい環境では30-35%に崩壊する。本研究は, モデルスケーリングではなく, データ収集における観察設計と環境多様性を原則とし, 低コストロボットが生活環境において確実に動作できることを実証するものである。

論文の概要: SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

関連論文リスト