Fugu-MT 論文翻訳(概要): Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

論文の概要: Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

arxiv url: http://arxiv.org/abs/2603.13912v1
Date: Sat, 14 Mar 2026 12:00:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.484659
Title: Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
Title（参考訳）: 非拘束エゴセントリックビデオにおける安定な自己スーパービジョンオブジェクト表現に向けて
Authors: Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang,
Abstract要約: 本研究では,非ラベル付きエゴセントリックビデオから安定したオブジェクト表現を学習するための統合ビジョントランスフォーマーフレームワークを提案する。 EgoViTは「プロトオブジェクト」を共同で発見・安定化することでこの学習プロセスをブートストラップする EgoViTは、教師なしオブジェクト発見における+8.0%のCorLoc改善と、セマンティックセグメンテーションにおける+4.8%のmIoU改善を実現している。
参考スコア（独自算出の注目度）: 8.642846048553041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.
Abstract（参考訳）: 人間は、自我中心の体験に基づく自己監督的な学習プロセスである、環境を知覚し、相互作用することで視覚知能を発達させる。これに触発されて、手動のアノテーションに頼ることなく、連続した未計算のファースト・パーソン・ビデオから安定したオブジェクト表現を学習する方法を尋ねる。この設定は、乱雑、隠蔽、エゴモーションの中でオブジェクトを分離し、認識し、永続的に追跡するという課題を提起する。 EgoViTは、ラベルなしのエゴセントリックビデオから安定したオブジェクト表現を学習するための統合ビジョントランスフォーマーフレームワークである。 EgoViTはこの学習プロセスを,(1)フレーム内蒸留を用いて識別的表現を形成するプロトオブジェクト学習,(2)これらの表現を幾何学的構造に根ざした深さ正規化,(3)時間とともにアイデンティティを強制する教師・フィルター付き時間一貫性の3つの相乗的メカニズムを通じて,共同で「プロトオブジェクト」の発見と安定化を行うことによって起動する。これにより、初期オブジェクト仮説が徐々に洗練され、安定で永続的な表現へと発展する活発なサイクルが生まれる。このフレームワークは、ラベルなしのファースト・パーソン・ビデオでエンドツーエンドに訓練され、様々な起源と品質の幾何学的先行に対して堅牢性を示す。標準的なベンチマークでは、EgoViTは、教師なしオブジェクト発見における+8.0%のCorLoc改善と、セマンティックセグメンテーションにおける+4.8%のmIoU改善を実現し、エンボディドインテリジェンスにおける堅牢な視覚的抽象化の基礎を築き上げる可能性を示している。

論文の概要: Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

関連論文リスト