Multimodal Across Domains Gaze Target Detection
- URL: http://arxiv.org/abs/2208.10822v1
- Date: Tue, 23 Aug 2022 09:09:00 GMT
- Title: Multimodal Across Domains Gaze Target Detection
- Authors: Francesco Tonini and Cigdem Beyan and Elisa Ricci
- Abstract summary: This paper addresses the gaze target detection problem in single images captured from the third-person perspective.
We present a multimodal deep architecture to infer where a person in a scene is looking.
- Score: 18.41238482101682
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper addresses the gaze target detection problem in single images
captured from the third-person perspective. We present a multimodal deep
architecture to infer where a person in a scene is looking. This spatial model
is trained on the head images of the person-of- interest, scene and depth maps
representing rich context information. Our model, unlike several prior art, do
not require supervision of the gaze angles, do not rely on head orientation
information and/or location of the eyes of person-of-interest. Extensive
experiments demonstrate the stronger performance of our method on multiple
benchmark datasets. We also investigated several variations of our method by
altering joint-learning of multimodal data. Some variations outperform a few
prior art as well. First time in this paper, we inspect domain adaption for
gaze target detection, and we empower our multimodal network to effectively
handle the domain gap across datasets. The code of the proposed method is
available at
https://github.com/francescotonini/multimodal-across-domains-gaze-target-detection.
Related papers
- Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - Can Deep Network Balance Copy-Move Forgery Detection and
Distinguishment? [3.7311680121118345]
Copy-move forgery detection is a crucial research area within digital image forensics.
Recent years have witnessed an increased interest in distinguishing between the original and duplicated objects in copy-move forgeries.
We propose an innovative method that employs the transformer architecture in an end-to-end deep neural network.
arXiv Detail & Related papers (2023-05-17T14:35:56Z) - Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene.
The proposed method achieves an increase in detection F1-score of 2-3 percentage points for the same number of gaze shifts.
arXiv Detail & Related papers (2022-08-24T14:59:28Z) - Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition [24.406654146411682]
Vision Transformer (ViT) is the research base for this paper.
Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images.
We propose a weakly supervised object localization-based approach to extract multi-scale local features.
arXiv Detail & Related papers (2022-04-22T14:38:40Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z) - Translate to Adapt: RGB-D Scene Recognition across Domains [18.40373730109694]
In this work we put under the spotlight the existence of a possibly severe domain shift issue within multi-modality scene recognition datasets.
We present a method based on self-supervised inter-modality translation able to adapt across different camera domains.
arXiv Detail & Related papers (2021-03-26T18:20:29Z) - Six-channel Image Representation for Cross-domain Object Detection [17.854940064699985]
Deep learning models are data-driven and the excellent performance is highly dependent on the abundant and diverse datasets.
Some image-to-image translation techniques are employed to generate some fake data of some specific scenes to train the models.
We propose to inspire the original 3-channel images and their corresponding GAN-generated fake images to form 6-channel representations of the dataset.
arXiv Detail & Related papers (2021-01-03T04:50:03Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - A Review of Single-Source Deep Unsupervised Visual Domain Adaptation [81.07994783143533]
Large-scale labeled training datasets have enabled deep neural networks to excel across a wide range of benchmark vision tasks.
In many applications, it is prohibitively expensive and time-consuming to obtain large quantities of labeled data.
To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain.
arXiv Detail & Related papers (2020-09-01T00:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.