Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model
- URL: http://arxiv.org/abs/2103.15438v1
- Date: Mon, 29 Mar 2021 09:09:39 GMT
- Title: Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model
- Authors: Yufan Liu, Minglang Qiao, Mai Xu, Bing Li, Weiming Hu, Ali Borji
- Abstract summary: We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
- Score: 96.24038430433885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, video streams have occupied a large proportion of Internet traffic,
most of which contain human faces. Hence, it is necessary to predict saliency
on multiple-face videos, which can provide attention cues for many content
based applications. However, most of multiple-face saliency prediction works
only consider visual information and ignore audio, which is not consistent with
the naturalistic scenarios. Several behavioral studies have established that
sound influences human attention, especially during the speech turn-taking in
multiple-face videos. In this paper, we thoroughly investigate such influences
by establishing a large-scale eye-tracking database of Multiple-face Video in
Visual-Audio condition (MVVA). Inspired by the findings of our investigation,
we propose a novel multi-modal video saliency model consisting of three
branches: visual, audio and face. The visual branch takes the RGB frames as the
input and encodes them into visual feature maps. The audio and face branches
encode the audio signal and multiple cropped faces, respectively. A fusion
module is introduced to integrate the information from three modalities, and to
generate the final saliency map. Experimental results show that the proposed
method outperforms 11 state-of-the-art saliency prediction works. It performs
closer to human multi-modal attention.
Related papers
- Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Audio-visual Representation Learning for Anomaly Events Detection in
Crowds [119.72951028190586]
This paper attempts to exploit multi-modal learning for modeling the audio and visual signals simultaneously.
We conduct the experiments on SHADE dataset, a synthetic audio-visual dataset in surveillance scenes.
We find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-10-28T02:42:48Z) - TriBERT: Full-body Human-centric Audio-visual Representation Learning
for Visual Sound Separation [35.93516937521393]
We introduce TriBERT -- a transformer-based architecture inspired by ViLBERT.
TriBERT enables contextual feature learning across three modalities: vision, pose, and audio.
We show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks.
arXiv Detail & Related papers (2021-10-26T04:50:42Z) - APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z) - Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person.
The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.