The Right to Talk: An Audio-Visual Transformer Approach
- URL: http://arxiv.org/abs/2108.03256v1
- Date: Fri, 6 Aug 2021 18:04:24 GMT
- Title: The Right to Talk: An Audio-Visual Transformer Approach
- Authors: Thanh-Dat Truong, Chi Nhan Duong, The De Vu, Hoang Anh Pham, Bhiksha
Raj, Ngan Le, Khoa Luu
- Abstract summary: This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
- Score: 27.71444773878775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Turn-taking has played an essential role in structuring the regulation of a
conversation. The task of identifying the main speaker (who is properly taking
his/her turn of speaking) and the interrupters (who are interrupting or
reacting to the main speaker's utterances) remains a challenging task. Although
some prior methods have partially addressed this task, there still remain some
limitations. Firstly, a direct association of Audio and Visual features may
limit the correlations to be extracted due to different modalities. Secondly,
the relationship across temporal segments helping to maintain the consistency
of localization, separation, and conversation contexts is not effectively
exploited. Finally, the interactions between speakers that usually contain the
tracking and anticipatory decisions about the transition to a new speaker are
usually ignored. Therefore, this work introduces a new Audio-Visual Transformer
approach to the problem of localization and highlighting the main speaker in
both audio and visual channels of a multi-speaker conversation video in the
wild. The proposed method exploits different types of correlations presented in
both visual and audio signals. The temporal audio-visual relationships across
spatial-temporal space are anticipated and optimized via the self-attention
mechanism in a Transformerstructure. Moreover, a newly collected dataset is
introduced for the main speaker detection. To the best of our knowledge, it is
one of the first studies that is able to automatically localize and highlight
the main speaker in both visual and audio channels in multi-speaker
conversation videos.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization [25.213694510527436]
Most existing speaker diarization systems rely exclusively on unimodal acoustic information.
We propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization.
Our approach consistently outperforms state-of-the-art speaker diarization methods.
arXiv Detail & Related papers (2024-08-22T03:34:03Z) - Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types.
Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning.
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Egocentric Auditory Attention Localization in Conversations [25.736198724595486]
We propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention.
Our approach leverages features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset.
arXiv Detail & Related papers (2023-03-28T14:52:03Z) - A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active
Speaker Selection [9.914246432182873]
We show that an end-to-end model performs at least as well as a considerably larger two-step system under various noise conditions.
In experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task.
arXiv Detail & Related papers (2022-05-11T15:55:31Z) - A Real-time Speaker Diarization System Based on Spatial Spectrum [14.189768987932364]
We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks.
First, a differential directional microphone array-based approach is exploited to capture the target speakers' voice in far-field adverse environment.
Second, an online speaker-location joint clustering approach is proposed to keep track of speaker location.
Third, an instant speaker number detector is developed to trigger the mechanism that separates overlapped speech.
arXiv Detail & Related papers (2021-07-20T08:25:23Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.