Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection
- URL: http://arxiv.org/abs/2504.19271v1
- Date: Sun, 27 Apr 2025 14:59:13 GMT
- Title: Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection
- Authors: Athul M. Mathew, Arshad Ali Khan, Thariq Khalid, Faroq AL-Tam, Riad Souissi,
- Abstract summary: We propose a novel method for GTD that fuses multiple pieces of information extracted from an image.<n>First, we project the 2D image into a 3D representation using monocular depth estimation.<n>We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (\textit{attention-grabbing}) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.
Related papers
- Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection [19.478147736434394]
Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target.
This paper presents a novel approach by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction.
We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset.
arXiv Detail & Related papers (2024-09-26T14:35:06Z) - What You See Is What You Detect: Towards better Object Densification in
3D detection [2.3436632098950456]
The widely-used full-shape completion approach actually leads to a higher error-upper bound especially for far away objects and small objects like pedestrians.
We introduce a visible part completion method that requires only 11.3% of the prediction points that previous methods generate.
To recover the dense representation, we propose a mesh-deformation-based method to augment the point set associated with visible foreground objects.
arXiv Detail & Related papers (2023-10-27T01:46:37Z) - Multimodal Across Domains Gaze Target Detection [18.41238482101682]
This paper addresses the gaze target detection problem in single images captured from the third-person perspective.
We present a multimodal deep architecture to infer where a person in a scene is looking.
arXiv Detail & Related papers (2022-08-23T09:09:00Z) - Probabilistic and Geometric Depth: Detecting Objects in Perspective [78.00922683083776]
3D object detection is an important capability needed in various practical applications such as driver assistance systems.
Monocular 3D detection, as an economical solution compared to conventional settings relying on binocular vision or LiDAR, has drawn increasing attention recently but still yields unsatisfactory results.
This paper first presents a systematic study on this problem and observes that the current monocular 3D detection problem can be simplified as an instance depth estimation problem.
arXiv Detail & Related papers (2021-07-29T16:30:33Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection.
We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z) - Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD
Images [69.5662419067878]
Grounding referring expressions in RGBD image has been an emerging field.
We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion.
Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that localizes the relevant regions in the RGBD image.
Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object.
arXiv Detail & Related papers (2021-03-14T11:18:50Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.