Fugu-MT 論文翻訳(概要): Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

論文の概要: Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

arxiv url: http://arxiv.org/abs/2511.15279v1
Date: Wed, 19 Nov 2025 09:42:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.737082
Title: Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception
Title（参考訳）: 見ろ、ズーム、見ろ! 身体認識のためのロボット眼球
Authors: Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian,
Abstract要約: 既存のビジョンモデルと固定RGB-Dカメラシステムは、細かな詳細取得で広域範囲を調整できない。本研究では,アクティブな視覚知覚のためのロボット眼球であるEyeVLAを提案する。
参考スコア（独自算出の注目度）: 8.542874528320004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
Abstract（参考訳）: 目標は、静的画像を受動的に処理するのではなく、ピクセルや空間予算の制約の中でより情報的なデータを積極的に取得することである。既存のビジョンモデルと固定RGB-Dカメラシステムは、網羅範囲を細部まで細部まで絞り込むことができず、オープンワールドのロボティクスアプリケーションにおいてその効果を著しく制限している。この問題に対処するために,ロボット眼球を用いた視覚認識用眼球システムEyeVLAを提案する。 EyeVLAはアクションの振る舞いをアクショントークンに識別し、強力なオープンワールド理解能力を持つ視覚言語モデル(VLM)と統合し、単一の自己回帰シーケンス内の視覚、言語、行動の合同モデリングを可能にする。 2Dバウンディングボックス座標を用いて推論連鎖を導出し、強化学習を適用して視点選択ポリシーを洗練させることにより、VLMのオープンワールドシーン理解能力を、最小の実世界データのみを用いて視覚言語行動(VLA)ポリシーに転送する。実験により,本システムは実環境下での指導シーンを効率よく実行し,回転・ズームの指示駆動動作により,より正確な視覚情報を積極的に取得し,環境認識能力の向上を図っている。 EyeVLAは、細密で空間的に豊かな大規模なエンボディドデータを活用する新しいロボットビジョンシステムを導入し、下流のエンボディドタスクに対して、高度に情報的な視覚的観察を積極的に取得する。

論文の概要: Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

関連論文リスト