A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active
Speaker Selection
- URL: http://arxiv.org/abs/2205.05684v1
- Date: Wed, 11 May 2022 15:55:31 GMT
- Title: A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active
Speaker Selection
- Authors: Otavio Braga, Olivier Siohan
- Abstract summary: We show that an end-to-end model performs at least as well as a considerably larger two-step system under various noise conditions.
In experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task.
- Score: 9.914246432182873
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Audio-visual automatic speech recognition is a promising approach to robust
ASR under noisy conditions. However, up until recently it had been
traditionally studied in isolation assuming the video of a single speaking face
matches the audio, and selecting the active speaker at inference time when
multiple people are on screen was put aside as a separate problem. As an
alternative, recent work has proposed to address the two problems
simultaneously with an attention mechanism, baking the speaker selection
problem directly into a fully differentiable model. One interesting finding was
that the attention indirectly learns the association between the audio and the
speaking face even though this correspondence is never explicitly provided at
training time. In the present work we further investigate this connection and
examine the interplay between the two problems. With experiments involving over
50 thousand hours of public YouTube videos as training data, we first evaluate
the accuracy of the attention layer on an active speaker selection task.
Secondly, we show under closer scrutiny that an end-to-end model performs at
least as well as a considerably larger two-step system that utilizes a hard
decision boundary under various noise conditions and number of parallel face
tracks.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - Rethinking Audio-visual Synchronization for Active Speaker Detection [62.95962896690992]
Existing research on active speaker detection (ASD) does not agree on the definition of active speakers.
We propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
arXiv Detail & Related papers (2022-06-21T14:19:06Z) - Best of Both Worlds: Multi-task Audio-Visual Automatic Speech
Recognition and Active Speaker Detection [9.914246432182873]
In noisy conditions, automatic speech recognition can benefit from the addition of visual signals coming from a video of the speaker's face.
Active speaker detection involves selecting at each moment in time which of the visible faces corresponds to the audio.
Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces.
This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss.
arXiv Detail & Related papers (2022-05-10T23:03:19Z) - Smoothing Dialogue States for Open Conversational Machine Reading [70.83783364292438]
We propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation.
Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-08-28T08:04:28Z) - The Right to Talk: An Audio-Visual Transformer Approach [27.71444773878775]
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
arXiv Detail & Related papers (2021-08-06T18:04:24Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.