Related papers: Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

URL: http://arxiv.org/abs/2506.09792v2
Date: Sun, 15 Jun 2025 08:24:31 GMT
Title: Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
Authors: Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li,
Abstract summary: We explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE.<n>In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals.<n>Without any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility.
Score: 87.49303116989708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE. In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals. Without introducing any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility. Furthermore, we evaluate our method in multi-language settings and visual cue-impaired scenarios and show robust performance gains.

Related papers

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition [34.08564665311891]
directional-SpeechLlama is a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition.<n> Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio.
arXiv Detail & Related papers (2025-06-17T20:49:41Z)
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement [36.136070412464214]
Speech enhancement (SE) aims to improve the quality and intelligibility of speech in noisy environments.<n>Recent studies have shown that incorporating visual cues in audio signal processing can enhance SE performance.<n>We propose a novel multi-modal learning framework, termed DLAV-SE, which leverages a diffusion-based model integrating audio, visual, and linguistic information.
arXiv Detail & Related papers (2025-01-23T04:36:29Z)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. We propose a novel training strategy, processing with visual speech units. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z)
Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform. We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z)
SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z)
Supervised Acoustic Embeddings And Their Transferability Across Languages [2.28438857884398]
In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise. Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
arXiv Detail & Related papers (2023-01-03T09:37:24Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter. We also create test sets that capture a variety of scenarios that users can encounter during video conferencing. Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.