Target conversation extraction: Source separation using turn-taking dynamics
- URL: http://arxiv.org/abs/2407.11277v2
- Date: Mon, 29 Jul 2024 19:35:36 GMT
- Title: Target conversation extraction: Source separation using turn-taking dynamics
- Authors: Tuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota,
- Abstract summary: We introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants.
Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets.
In the presence of interfering speakers, our results show an 8.19 dB improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dB improvement for 2-4-speaker conversations.
- Score: 23.189364779538757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in human conversations, particularly turn-taking dynamics, which uniquely characterize speakers engaged in conversation and distinguish them from interfering speakers and noise. Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets. In the presence of interfering speakers, our results show an 8.19 dB improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dB improvement for 2-4-speaker conversations. Code, dataset available at https://github.com/chentuochao/Target-Conversation-Extraction.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - Joining the Conversation: Towards Language Acquisition for Ad Hoc Team
Play [1.370633147306388]
We propose and consider the problem of cooperative language acquisition as a particular form of the ad hoc team play problem.
We present a probabilistic model for inferring a speaker's intentions and a listener's semantics from observing communications between a team of language-users.
arXiv Detail & Related papers (2023-05-20T16:59:27Z) - Question-Interlocutor Scope Realized Graph Modeling over Key Utterances
for Dialogue Reading Comprehension [61.55950233402972]
We propose a new key utterances extracting method for dialogue reading comprehension.
It performs prediction on the unit formed by several contiguous utterances, which can realize more answer-contained utterances.
As a graph constructed on the text of utterances, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling.
arXiv Detail & Related papers (2022-10-26T04:00:42Z) - Speaker Extraction with Co-Speech Gestures Cue [79.91394239104908]
We explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction.
We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker.
The experimental results show that the co-speech gestures cue is informative in associating the target speaker, and the quality of the extracted speech shows significant improvements over the unprocessed mixture speech.
arXiv Detail & Related papers (2022-03-31T06:48:52Z) - Enhanced Speaker-aware Multi-party Multi-turn Dialogue Comprehension [43.352833140317486]
Multi-party multi-turn dialogue comprehension brings unprecedented challenges.
Most existing methods deal with dialogue contexts as plain texts.
We propose an enhanced speaker-aware model with masking attention and heterogeneous graph networks.
arXiv Detail & Related papers (2021-09-09T07:12:22Z) - Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance
for Multi-party Dialogue Reading Comprehension [46.69961067676279]
Multi-party dialogue machine reading comprehension (MRC) brings tremendous challenge since it involves multiple speakers at one dialogue.
Previous models focus on how to incorporate speaker information flows using complex graph-based modules.
In this paper, we design two labour-free self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to implicitly model the speaker information flows.
arXiv Detail & Related papers (2021-09-08T16:51:41Z) - The Right to Talk: An Audio-Visual Transformer Approach [27.71444773878775]
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
arXiv Detail & Related papers (2021-08-06T18:04:24Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.