Related papers: Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

URL: http://arxiv.org/abs/2310.07284v3
Date: Sun, 15 Oct 2023 03:58:29 GMT
Title: Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction
Authors: Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan
Abstract summary: This study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing target speaker extraction models. We propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input. Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues.
Score: 39.985710814952625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to extract the sound source of interest. However, the effectiveness of these models is hindered in real-world scenarios due to the unreliable or even absence of pre-registered cues. To address this limitation, this study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing TSE models. Specifically, we propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input. These cues can serve as independent extraction cues, task selectors to control the TSE process or complement the pre-registered cues. Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues. To our knowledge, this is the first study to successfully incorporate LLMs to guide target speaker extraction, which can be a cornerstone for cocktail party problem research.

Related papers

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z)
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline [29.85417427778784]
SoloSpeech is a cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes.<n>It achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks.
arXiv Detail & Related papers (2025-05-25T21:00:48Z)
Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction [50.630431647192054]
This paper investigates a novel approach for Target Speech Extraction (TSE) It relies solely on textual context to extract the target speech. We present three CSE models and analyze their performances on three datasets.
arXiv Detail & Related papers (2025-03-11T18:26:10Z)
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation [5.528860524494717]
This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation. By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved.
arXiv Detail & Related papers (2024-10-04T04:59:50Z)
SIG: Speaker Identification in Literature via Prompt-Based Generation [13.042070464592374]
We propose a generation-based method that verbalizes the task and quotation input based on designed prompt templates. The prediction can either come from direct generation by the model, or be determined by the highest generation probability of each speaker candidate. We perform both cross-domain evaluation and in-domain evaluation on PDNC, the largest dataset of this task.
arXiv Detail & Related papers (2023-12-22T10:29:18Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Topic Identification For Spontaneous Speech: Enriching Audio Features With Embedded Linguistic Information [10.698093106994804]
Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts. We compare audio-only and hybrid techniques of jointly utilising text and audio features. The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available.
arXiv Detail & Related papers (2023-07-21T09:30:46Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs [21.658650440278063]
We propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance.
arXiv Detail & Related papers (2021-04-07T20:48:08Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
Exploiting Unsupervised Data for Emotion Recognition in Conversations [76.01690906995286]
Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations. The available supervised data for the ERC task is limited. We propose a novel approach to leverage unsupervised conversation data.
arXiv Detail & Related papers (2020-10-02T13:28:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.