Rethinking Audio-visual Synchronization for Active Speaker Detection
- URL: http://arxiv.org/abs/2206.10421v1
- Date: Tue, 21 Jun 2022 14:19:06 GMT
- Title: Rethinking Audio-visual Synchronization for Active Speaker Detection
- Authors: Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang
- Abstract summary: Existing research on active speaker detection (ASD) does not agree on the definition of active speakers.
We propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
- Score: 62.95962896690992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Active speaker detection (ASD) systems are important modules for analyzing
multi-talker conversations. They aim to detect which speakers or none are
talking in a visual scene at any given time. Existing research on ASD does not
agree on the definition of active speakers. We clarify the definition in this
work and require synchronization between the audio and visual speaking
activities. This clarification of definition is motivated by our extensive
experiments, through which we discover that existing ASD methods fail in
modeling the audio-visual synchronization and often classify unsynchronized
videos as active speaking. To address this problem, we propose a cross-modal
contrastive learning strategy and apply positional encoding in attention
modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect
unsynchronized speaking as not speaking, addressing the limitation of current
models.
Related papers
- Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active
Speaker Selection [9.914246432182873]
We show that an end-to-end model performs at least as well as a considerably larger two-step system under various noise conditions.
In experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task.
arXiv Detail & Related papers (2022-05-11T15:55:31Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.