Using Active Speaker Faces for Diarization in TV shows
- URL: http://arxiv.org/abs/2203.15961v1
- Date: Wed, 30 Mar 2022 00:37:19 GMT
- Title: Using Active Speaker Faces for Diarization in TV shows
- Authors: Rahul Sharma and Shrikanth Narayanan
- Abstract summary: We perform face clustering on the active speaker faces and show superior speaker diarization performance compared to the state-of-the-art audio-based diarization methods.
We also observe that a moderately well-performing active speaker system could outperform the audio-based diarization systems.
- Score: 37.28070242751129
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker diarization is one of the critical components of computational media
intelligence as it enables a character-level analysis of story portrayals and
media content understanding. Automated audio-based speaker diarization of
entertainment media poses challenges due to the diverse acoustic conditions
present in media content, be it background music, overlapping speakers, or
sound effects. At the same time, speaking faces in the visual modality provide
complementary information and not prone to the errors seen in the audio
modality. In this paper, we address the problem of speaker diarization in TV
shows using the active speaker faces. We perform face clustering on the active
speaker faces and show superior speaker diarization performance compared to the
state-of-the-art audio-based diarization methods. We additionally report a
systematic analysis of the impact of active speaker face detection quality on
the diarization performance. We also observe that a moderately well-performing
active speaker system could outperform the audio-based diarization systems.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - Unsupervised active speaker detection in media content using cross-modal
information [37.28070242751129]
We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies.
We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task.
We show competitive performance to state-of-the-art fully supervised methods.
arXiv Detail & Related papers (2022-09-24T00:51:38Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - The Right to Talk: An Audio-Visual Transformer Approach [27.71444773878775]
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
arXiv Detail & Related papers (2021-08-06T18:04:24Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Cross modal video representations for weakly supervised active speaker
localization [39.67239953795999]
Cross-modal neural network for learning visual representations is presented.
We present a weakly supervised system for the task of localizing active speakers in movie content.
We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework.
arXiv Detail & Related papers (2020-03-09T18:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.