UniCon: Unified Context Network for Robust Active Speaker Detection
- URL: http://arxiv.org/abs/2108.02607v1
- Date: Thu, 5 Aug 2021 13:25:44 GMT
- Title: UniCon: Unified Context Network for Robust Active Speaker Detection
- Authors: Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu,
Shiguang Shan, Xilin Chen
- Abstract summary: We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD)
Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information.
A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
- Score: 111.90529347692723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new efficient framework, the Unified Context Network (UniCon),
for robust active speaker detection (ASD). Traditional methods for ASD usually
operate on each candidate's pre-cropped face track separately and do not
sufficiently consider the relationships among the candidates. This potentially
limits performance, especially in challenging scenarios with low-resolution
faces, multiple candidates, etc. Our solution is a novel, unified framework
that focuses on jointly modeling multiple types of contextual information:
spatial context to indicate the position and scale of each candidate's face,
relational context to capture the visual relationships among the candidates and
contrast audio-visual affinities with each other, and temporal context to
aggregate long-term information and smooth out local uncertainties. Based on
such information, our model optimizes all candidates in a unified process for
robust and reliable ASD. A thorough ablation study is performed on several
challenging ASD benchmarks under different settings. In particular, our method
outperforms the state-of-the-art by a large margin of about 15% mean Average
Precision (mAP) absolute on two challenging subsets: one with three candidate
speakers, and the other with faces smaller than 64 pixels. Together, our UniCon
achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for
the first time on this challenging dataset at the time of submission. Project
website: https://unicon-asd.github.io/.
Related papers
- ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - A Light Weight Model for Active Speaker Detection [7.253335671577093]
We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling.
Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%).
Our framework also performs well on the Columbia dataset showing good robustness.
arXiv Detail & Related papers (2023-03-08T08:40:56Z) - Global-Local Context Network for Person Search [125.51080862575326]
Person search aims to jointly localize and identify a query person from natural, uncropped images.
We exploit rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively.
We propose a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement.
arXiv Detail & Related papers (2021-12-05T07:38:53Z) - Seeking the Shape of Sound: An Adaptive Framework for Learning
Voice-Face Association [94.7030305679589]
We propose a novel framework to jointly address the above-mentioned issues.
We introduce a global loss into the modality alignment process.
The proposed method outperforms the previous methods in multiple settings.
arXiv Detail & Related papers (2021-03-12T14:10:48Z) - InstanceRefer: Cooperative Holistic Understanding for Visual Grounding
on Point Clouds through Instance Multi-level Contextual Referring [38.13420293700949]
We propose a new model, named InstanceRefer, to achieve a superior 3D visual grounding on point clouds.
Our model first filters instances from panoptic segmentation on point clouds to obtain a small number of candidates.
Experiments confirm that our InstanceRefer outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-01T16:59:27Z) - A Unified Deep Learning Framework for Short-Duration Speaker
Verification in Adverse Environments [16.91453126121351]
Speaker verification (SV) system should be robust to short speech segments, especially in noisy and reverberant environments.
To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD)
We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner.
arXiv Detail & Related papers (2020-10-06T04:51:45Z) - Symbiotic Adversarial Learning for Attribute-based Person Search [86.7506832053208]
We present a symbiotic adversarial learning framework, called SAL.Two GANs sit at the base of the framework in a symbiotic learning scheme.
Specifically, two different types of generative adversarial networks learn collaboratively throughout the training process.
arXiv Detail & Related papers (2020-07-19T07:24:45Z) - Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary.
We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.