Multi-view Gaze Target Estimation
- URL: http://arxiv.org/abs/2508.05857v1
- Date: Thu, 07 Aug 2025 21:15:15 GMT
- Title: Multi-view Gaze Target Estimation
- Authors: Qiaomu Miao, Vivek Raju Golani, Jingyi Xu, Progga Paromita Dutta, Minh Hoai, Dimitris Samaras,
- Abstract summary: This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task.<n>The approach integrates information from different camera views to improve accuracy and expand applicability.<n>The paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods.
- Score: 42.72151752119855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-of-view targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertainty-based Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person's face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. Furthermore, the paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods. Data and code are available at https://www3.cs.stonybrook.edu/~cvl/multiview_gte.html
Related papers
- Egocentric Gaze Estimation via Neck-Mounted Camera [27.513961366278455]
This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective.<n>We collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities.<n>We propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models.
arXiv Detail & Related papers (2026-02-12T07:41:27Z) - MVTOP: Multi-View Transformer-based Object Pose-Estimation [4.485458895311131]
We present MVTOP, a novel transformer-based method for multi-view rigid object pose estimation.<n>Our method can resolve pose ambiguities that would be impossible to solve with a single view or with a post-processing of single-view poses.
arXiv Detail & Related papers (2025-08-05T09:21:14Z) - Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos [66.1935609072708]
LangView is a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.<n>During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval [85.73149096516543]
We address the choice of viewpoint during sketch creation in Fine-Grained Sketch-Based Image Retrieval (FG-SBIR)
A pilot study highlights the system's struggle when query-sketches differ in viewpoint from target instances.
To reconcile this, we advocate for a view-aware system, seamlessly accommodating both view-agnostic and view-specific tasks.
arXiv Detail & Related papers (2024-07-01T21:20:44Z) - UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation [10.412375913640224]
We propose a novel 1-view-to-2-views (1-to-2 views) adaptation solution for gaze estimation.
Our method adapts a traditional single-view gaze estimator for flexibly placed dual cameras.
Experiments show that a single-view estimator, when adapted for dual views, can achieve much higher accuracy, especially in cross-dataset settings.
arXiv Detail & Related papers (2023-12-25T08:13:28Z) - A Simple Baseline for Supervised Surround-view Depth Estimation [25.81521612343612]
We propose S3Depth, a Simple Baseline for Supervised Surround-view Depth Estimation.
We employ a global-to-local feature extraction module which combines CNN with transformer layers for enriched representations.
Our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets.
arXiv Detail & Related papers (2023-03-14T10:06:19Z) - Learning to Select Camera Views: Efficient Multiview Understanding at
Few Glances [59.34619548026885]
We propose a view selection approach that analyzes the target object or scenario from given views and selects the next best view for processing.
Our approach features a reinforcement learning based camera selection module, MVSelect, that not only selects views but also facilitates joint training with the task network.
arXiv Detail & Related papers (2023-03-10T18:59:10Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - Learning Implicit 3D Representations of Dressed Humans from Sparse Views [31.584157304372425]
We propose an end-to-end approach that learns an implicit 3D representation of dressed humans from sparse camera views.
In the experiments, we show the proposed approach outperforms the state of the art on standard data both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-04-16T10:20:26Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.