Related papers: A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings

A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings

URL: http://arxiv.org/abs/2307.05158v1
Date: Tue, 11 Jul 2023 10:30:33 GMT
Title: A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings
Authors: Anshul Gupta, Samy Tafasca, Jean-Marc Odobez
Abstract summary: We propose a modular multimodal architecture allowing to combine multimodal cues using an attention mechanism. The architecture can naturally be exploited in privacy-sensitive situations such as surveillance and health, where personally identifiable information cannot be released.
Score: 18.885623017619988
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting where a person is looking is a complex task, requiring to understand not only the person's gaze and scene content, but also the 3D scene structure and the person's situation (are they manipulating? interacting or observing others? attentive?) to detect obstructions in the line of sight or apply attention priors that humans typically have when observing others. In this paper, we hypothesize that identifying and leveraging such priors can be better achieved through the exploitation of explicitly derived multimodal cues such as depth and pose. We thus propose a modular multimodal architecture allowing to combine these cues using an attention mechanism. The architecture can naturally be exploited in privacy-sensitive situations such as surveillance and health, where personally identifiable information cannot be released. We perform extensive experiments on the GazeFollow and VideoAttentionTarget public datasets, obtaining state-of-the-art performance and demonstrating very competitive results in the privacy setting case.

Related papers

Doxing via the Lens: Revealing Privacy Leakage in Image Geolocation for Agentic Multi-Modal Large Reasoning Model [31.245820767782792]
ChatGPT o3 can predict user locations with high precision, achieving street-level accuracy (within one mile) in 60% of cases. Our findings highlight an urgent need for privacy-aware development for agentic multi-modal large reasoning models.
arXiv Detail & Related papers (2025-04-27T22:26:45Z)
Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection [19.478147736434394]
Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset.
arXiv Detail & Related papers (2024-09-26T14:35:06Z)
Modeling User Preferences via Brain-Computer Interfacing [54.3727087164445]
We use Brain-Computer Interfacing technology to infer users' preferences, their attentional correlates towards visual content, and their associations with affective experience. We link these to relevant applications, such as information retrieval, personalized steering of generative models, and crowdsourcing population estimates of affective experiences.
arXiv Detail & Related papers (2024-05-15T20:41:46Z)
Generalizable Person Search on Open-world User-Generated Video Content [93.72028298712118]
Person search is a challenging task that involves retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. We propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios.
arXiv Detail & Related papers (2023-10-16T04:59:50Z)
Human-centric Behavior Description in Videos: New Benchmark and Model [37.96539992056626]
We construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos.
arXiv Detail & Related papers (2023-10-04T15:31:02Z)
MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain [23.598727613908853]
We present MECCANO, a dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view.
arXiv Detail & Related papers (2022-09-19T00:52:42Z)
OPOM: Customized Invisible Cloak towards Face Privacy Protection [58.07786010689529]
We investigate the face privacy protection from a technology standpoint based on a new type of customized cloak. We propose a new method, named one person one mask (OPOM), to generate person-specific (class-wise) universal masks. The effectiveness of the proposed method is evaluated on both common and celebrity datasets.
arXiv Detail & Related papers (2022-05-24T11:29:37Z)
GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze. Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z)
SPAct: Self-supervised Privacy Preservation for Action Recognition [73.79886509500409]
Existing approaches for mitigating privacy leakage in action recognition require privacy labels along with the action labels from the video dataset. Recent developments of self-supervised learning (SSL) have unleashed the untapped potential of the unlabeled data. We present a novel training framework which removes privacy information from input video in a self-supervised manner without requiring privacy labels.
arXiv Detail & Related papers (2022-03-29T02:56:40Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
Detecting Attended Visual Targets in Video [25.64146711657225]
We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. We obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
arXiv Detail & Related papers (2020-03-05T09:29:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.