4D Attention: Comprehensive Framework for Spatio-Temporal Gaze Mapping
- URL: http://arxiv.org/abs/2107.03606v1
- Date: Thu, 8 Jul 2021 04:55:18 GMT
- Title: 4D Attention: Comprehensive Framework for Spatio-Temporal Gaze Mapping
- Authors: Shuji Oishi, Kenji Koide, Masashi Yokozuka, Atsuhiko Banno
- Abstract summary: This study presents a framework for capturing human attention in the gaze-temporal domain using eye-tracking glasses.
We estimate the pose by leveraging a loose coupling of direct visual localization and Inertial Measurement Unit (IMU) values.
By installing reconstruction components into our framework, dynamic objects not captured in the 3D environment are instantiated based on the input textures.
- Score: 4.215251065887861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study presents a framework for capturing human attention in the
spatio-temporal domain using eye-tracking glasses. Attention mapping is a key
technology for human perceptual activity analysis or Human-Robot Interaction
(HRI) to support human visual cognition; however, measuring human attention in
dynamic environments is challenging owing to the difficulty in localizing the
subject and dealing with moving objects. To address this, we present a
comprehensive framework, 4D Attention, for unified gaze mapping onto static and
dynamic objects. Specifically, we estimate the glasses pose by leveraging a
loose coupling of direct visual localization and Inertial Measurement Unit
(IMU) values. Further, by installing reconstruction components into our
framework, dynamic objects not captured in the 3D environment map are
instantiated based on the input images. Finally, a scene rendering component
synthesizes a first-person view with identification (ID) textures and performs
direct 2D-3D gaze association. Quantitative evaluations showed the
effectiveness of our framework. Additionally, we demonstrated the applications
of 4D Attention through experiments in real situations.
Related papers
- Reconstructing 4D Spatial Intelligence: A Survey [57.8684548664209]
Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision.<n>We present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence.
arXiv Detail & Related papers (2025-07-28T17:59:02Z) - Object Concepts Emerge from Motion [24.73461163778215]
We propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner.<n>Our key insight is that motion boundary serves as a strong signal for object-level grouping.<n>Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data.
arXiv Detail & Related papers (2025-05-27T18:09:02Z) - Object Learning and Robust 3D Reconstruction [7.092348056331202]
We discuss architectural designs and training methods for a neural network to dissect an image into objects of interest without supervision.
FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios.
We leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects.
arXiv Detail & Related papers (2025-04-22T21:48:31Z) - Bringing Objects to Life: 4D generation from 3D objects [31.533802484121182]
We introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation.
Our method achieves up to threefold improvements in identity preservation measured using LPIPS scores.
arXiv Detail & Related papers (2024-12-29T10:12:01Z) - Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - Interaction-Driven Active 3D Reconstruction with Object Interiors [17.48872400701787]
We introduce an active 3D reconstruction method which integrates visual perception, robot-object interaction, and 3D scanning.
Our method operates fully automatically by a Fetch robot with built-in RGBD sensors.
arXiv Detail & Related papers (2023-10-23T08:44:38Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - 3D Object Aided Self-Supervised Monocular Depth Estimation [5.579605877061333]
We propose a new method to address dynamic object movements through monocular 3D object detection.
Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose.
In this way, the depth of every pixel can be learned via a meaningful geometry model.
arXiv Detail & Related papers (2022-12-04T08:52:33Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Attentive and Contrastive Learning for Joint Depth and Motion Field
Estimation [76.58256020932312]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task.
We present a self-supervised learning framework for 3D object motion field estimation from monocular videos.
arXiv Detail & Related papers (2021-10-13T16:45:01Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z) - Kinematic 3D Object Detection in Monocular Video [123.7119180923524]
We propose a novel method for monocular video-based 3D object detection which carefully leverages kinematic motion to improve precision of 3D localization.
We achieve state-of-the-art performance on monocular 3D object detection and the Bird's Eye View tasks within the KITTI self-driving dataset.
arXiv Detail & Related papers (2020-07-19T01:15:12Z) - 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places,
Objects, and Humans [27.747241700017728]
We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs.
3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.
arXiv Detail & Related papers (2020-02-15T00:46:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.