Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
- URL: http://arxiv.org/abs/2510.22672v2
- Date: Tue, 28 Oct 2025 08:39:14 GMT
- Title: Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
- Authors: Anna Deichler, Jonas Beskow,
- Abstract summary: We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives.<n>Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen.<n>The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions.
- Score: 5.723697351415207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
Related papers
- MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction [23.428989479526336]
We present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction.<n>The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation.<n>Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications.
arXiv Detail & Related papers (2025-12-12T05:54:19Z) - IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants [7.869752673792282]
IndEgo is a multimodal egocentric and exocentric dataset addressing common industrial tasks.<n>The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings.<n>A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks.
arXiv Detail & Related papers (2025-11-24T20:45:17Z) - EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z) - MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans [4.098892268127572]
We present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR)
Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings.
arXiv Detail & Related papers (2024-09-30T21:51:30Z) - Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices.
It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km.
The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z) - Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z) - 3D Human Pose Perception from Egocentric Stereo Videos [67.9563319914377]
We propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation.
Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting.
We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.
arXiv Detail & Related papers (2023-12-30T21:21:54Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Enhancing Egocentric 3D Pose Estimation with Third Person Views [37.9683439632693]
We propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera.
We introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives.
Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos.
arXiv Detail & Related papers (2022-01-06T11:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.