Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
- URL: http://arxiv.org/abs/2506.06253v1
- Date: Fri, 06 Jun 2025 17:25:48 GMT
- Title: Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
- Authors: Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, Yoichi Sato,
- Abstract summary: Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition.<n>In this survey, we provide a review of video understanding from both exocentric and egocentric viewpoints.
- Score: 35.766320269860245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition, enabling rich and complementary understanding of dynamic environments. In recent years, allowing the machines to leverage the synergistic potential of these dual perspectives has emerged as a compelling research direction in video understanding. In this survey, we provide a comprehensive review of video understanding from both exocentric and egocentric viewpoints. We begin by highlighting the practical applications of integrating egocentric and exocentric techniques, envisioning their potential collaboration across domains. We then identify key research tasks to realize these applications. Next, we systematically organize and review recent advancements into three main research directions: (1) leveraging egocentric data to enhance exocentric understanding, (2) utilizing exocentric data to improve egocentric analysis, and (3) joint learning frameworks that unify both perspectives. For each direction, we analyze a diverse set of tasks and relevant works. Additionally, we discuss benchmark datasets that support research in both perspectives, evaluating their scope, diversity, and applicability. Finally, we discuss limitations in current works and propose promising future research directions. By synthesizing insights from both perspectives, our goal is to inspire advancements in video understanding and artificial intelligence, bringing machines closer to perceiving the world in a human-like manner. A GitHub repo of related works can be found at https://github.com/ayiyayi/Awesome-Egocentric-and-Exocentric-Vision.
Related papers
- Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z) - EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs [33.35844258541633]
EgoExoBench is the first benchmark for egocentric-exocentric video understanding and reasoning.<n>It comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning.<n>We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context.
arXiv Detail & Related papers (2025-07-24T12:14:49Z) - Is Tracking really more challenging in First Person Egocentric Vision? [10.025424391350027]
Recent research has benchmarked state-of-the-art methods and concluded that first person egocentric vision presents challenges.<n>Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities.<n>This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint versus the domain of human-object activities?
arXiv Detail & Related papers (2025-07-21T19:25:50Z) - Challenges and Trends in Egocentric Vision: A Survey [11.593894126370724]
Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body.<n>This paper provides a comprehensive survey of the research on egocentric vision understanding.<n>By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence.
arXiv Detail & Related papers (2025-03-19T14:51:27Z) - EgoLife: Towards Egocentric Life Assistant [60.51196061794498]
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses.<n>We conduct a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references.<n>This effort resulted in the EgoLife dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation.<n>We introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide
arXiv Detail & Related papers (2025-03-05T18:54:16Z) - EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World [12.699670048897085]
In human imitation learning, the imitator typically take the egocentric view as a benchmark, naturally transferring behaviors observed from an exocentric view to their owns.<n>We introduce EgoMe, which towards following the process of human imitation learning via the imitator's egocentric view in the real world.<n>Our dataset includes 7902 paired exo-ego videos spanning diverse daily behaviors in various real-world scenarios.
arXiv Detail & Related papers (2025-01-31T11:48:22Z) - Egocentric and Exocentric Methods: A Short Survey [25.41820386246096]
Egocentric vision captures the scene from the point of view of the camera wearer.<n>Exocentric vision captures the overall scene context.<n>Jointly modeling ego and exo views is crucial to developing next-generation AI agents.
arXiv Detail & Related papers (2024-10-27T22:38:51Z) - Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Vision-Based Manipulators Need to Also See from Their Hands [58.398637422321976]
We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations.
We find that a hand-centric (eye-in-hand) perspective affords reduced observability, but it consistently improves training efficiency and out-of-distribution generalization.
arXiv Detail & Related papers (2022-03-15T18:46:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.