Typing to Listen at the Cocktail Party: Text-Guided Target Speaker
Extraction
- URL: http://arxiv.org/abs/2310.07284v3
- Date: Sun, 15 Oct 2023 03:58:29 GMT
- Title: Typing to Listen at the Cocktail Party: Text-Guided Target Speaker
Extraction
- Authors: Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan
- Abstract summary: This study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing target speaker extraction models.
We propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input.
Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues.
- Score: 39.985710814952625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans possess an extraordinary ability to selectively focus on the sound
source of interest amidst complex acoustic environments, commonly referred to
as cocktail party scenarios. In an attempt to replicate this remarkable
auditory attention capability in machines, target speaker extraction (TSE)
models have been developed. These models leverage the pre-registered cues of
the target speaker to extract the sound source of interest. However, the
effectiveness of these models is hindered in real-world scenarios due to the
unreliable or even absence of pre-registered cues. To address this limitation,
this study investigates the integration of natural language description to
enhance the feasibility, controllability, and performance of existing TSE
models. Specifically, we propose a model named LLM-TSE, wherein a large
language model (LLM) extracts useful semantic cues from the user's typed text
input. These cues can serve as independent extraction cues, task selectors to
control the TSE process or complement the pre-registered cues. Our experimental
results demonstrate competitive performance when only text-based cues are
presented, the effectiveness of using input text as a task selector, and a new
state-of-the-art when combining text-based cues with pre-registered cues. To
our knowledge, this is the first study to successfully incorporate LLMs to
guide target speaker extraction, which can be a cornerstone for cocktail party
problem research.
Related papers
- SIG: Speaker Identification in Literature via Prompt-Based Generation [13.042070464592374]
We propose a generation-based method that verbalizes the task and quotation input based on designed prompt templates.
The prediction can either come from direct generation by the model, or be determined by the highest generation probability of each speaker candidate.
We perform both cross-domain evaluation and in-domain evaluation on PDNC, the largest dataset of this task.
arXiv Detail & Related papers (2023-12-22T10:29:18Z) - Furnishing Sound Event Detection with Language Model Abilities [11.435984426303419]
We propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location.
The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder.
arXiv Detail & Related papers (2023-08-22T15:59:06Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Speak or Chat with Me: End-to-End Spoken Language Understanding System
with Flexible Inputs [21.658650440278063]
We propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both.
Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance.
arXiv Detail & Related papers (2021-04-07T20:48:08Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.