Fugu-MT 論文翻訳(概要): PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

論文の概要: PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

arxiv url: http://arxiv.org/abs/2512.16793v2
Date: Wed, 04 Feb 2026 11:53:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.411675
Title: PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
Title（参考訳）: PhysBrain:視覚言語モデルから物理的知性へのブリッジとしての人間中心のデータ
Authors: Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Jiyan He, Cong Huang, Bojun Cheng, Kai Chen,
Abstract要約: Egocentric2Embodiment Translation Pipelineは、生の人間中心のビデオをマルチレベルなスキーマ駆動型実施監視に変換する。 E2E-3Mデータセットのトレーニングにより、Egocentric-aware embodied brainであるPhysBrainを得る。
参考スコア（独自算出の注目度）: 19.558594034613996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. Vision Language Models (VLMs) are essential to Vision-Language-Action (VLA) systems, but the reliance on third-person training data creates a viewpoint gap for humanoid robots. Collecting massive robot-centric data is an ideal but impractical solution due to cost and diversity constraints. Conversely, human egocentric videos offer a highly scalable data source with rich interaction context, yet the embodiment mismatch prevents the direct application. To bridge this gap, we propose an Egocentric2Embodiment Translation Pipeline that transforms raw human egocentric videos into multi-level, schema-driven embodiment supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher success rates, demonstrating effective transfer from human egocentric supervision to downstream robot control.
Abstract（参考訳）: ロボットの一般化は、物理的知性、すなわち、状態の変化、接触に富んだ相互作用、そして自我中心の知覚と行動の下での長い水平計画を推論する能力に依存している。視覚言語モデル(VLM)はビジョン・ランゲージ・アクション(VLA)システムに必須であるが、第三者のトレーニングデータへの依存はヒューマノイドロボットの視点ギャップを生み出す。大量のロボット中心のデータを収集することは、コストと多様性の制約により理想的だが非現実的な解決策である。逆に、人間中心の動画は、リッチなインタラクションコンテキストを持つ高度にスケーラブルなデータソースを提供する。このギャップを埋めるために,Egocentric2Embodiment Translation Pipelineを提案する。Egocentric2Embodiment dataset(E2E-3M)を大規模に構築することを可能にするため,生の人間中心の動画を多段階のスキーマ駆動型エンボディメント監視に変換する。 E2E-3Mデータセットのトレーニングにより、Egocentric-aware embodied brain(PhysBrain)が得られた。 PhysBrainは、特に計画において、エゴセントリックな理解を著しく改善している。これは、よりサンプル効率の良いVLA微調整とより高い成功率を可能にし、人間中心の監督から下流ロボット制御への効果的な移行を実証するエゴセントリックな初期化を提供する。

論文の概要: PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

関連論文リスト