Related papers: UniCon: Unified Context Network for Robust Active Speaker Detection

UniCon: Unified Context Network for Robust Active Speaker Detection

URL: http://arxiv.org/abs/2108.02607v1
Date: Thu, 5 Aug 2021 13:25:44 GMT
Title: UniCon: Unified Context Network for Robust Active Speaker Detection
Authors: Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, Xilin Chen
Abstract summary: We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD) Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information. A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
Score: 111.90529347692723
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for the first time on this challenging dataset at the time of submission. Project website: https://unicon-asd.github.io/.

Related papers

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios [22.15198429228792]
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection.<n>UniTalk focuses explicitly on diverse and difficult real-world conditions.<n>It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities.
arXiv Detail & Related papers (2025-05-28T04:08:59Z)
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning [31.61978841892981]
We introduce a novel dataset, FortisAVQA, constructed in two stages. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation. Our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81%.
arXiv Detail & Related papers (2025-04-01T07:23:50Z)
ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection [13.154512864498912]
We propose ASDnB, a model that singularly integrates face with body information. Our approach splits 3D convolution into 2D and 1D to reduce computation cost without loss of performance.
arXiv Detail & Related papers (2024-12-11T18:12:06Z)
ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval. ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
Aligning and Prompting Everything All at Once for Universal Visual Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks. APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection. Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
A Light Weight Model for Active Speaker Detection [7.253335671577093]
We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%). Our framework also performs well on the Columbia dataset showing good robustness.
arXiv Detail & Related papers (2023-03-08T08:40:56Z)
Global-Local Context Network for Person Search [125.51080862575326]
Person search aims to jointly localize and identify a query person from natural, uncropped images. We exploit rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively. We propose a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement.
arXiv Detail & Related papers (2021-12-05T07:38:53Z)
Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association [94.7030305679589]
We propose a novel framework to jointly address the above-mentioned issues. We introduce a global loss into the modality alignment process. The proposed method outperforms the previous methods in multiple settings.
arXiv Detail & Related papers (2021-03-12T14:10:48Z)
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring [38.13420293700949]
We propose a new model, named InstanceRefer, to achieve a superior 3D visual grounding on point clouds. Our model first filters instances from panoptic segmentation on point clouds to obtain a small number of candidates. Experiments confirm that our InstanceRefer outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-01T16:59:27Z)
A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments [16.91453126121351]
Speaker verification (SV) system should be robust to short speech segments, especially in noisy and reverberant environments. To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD) We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner.
arXiv Detail & Related papers (2020-10-06T04:51:45Z)
Symbiotic Adversarial Learning for Attribute-based Person Search [86.7506832053208]
We present a symbiotic adversarial learning framework, called SAL.Two GANs sit at the base of the framework in a symbiotic learning scheme. Specifically, two different types of generative adversarial networks learn collaboratively throughout the training process.
arXiv Detail & Related papers (2020-07-19T07:24:45Z)
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary. We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.