VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
- URL: http://arxiv.org/abs/2210.06177v1
- Date: Sun, 9 Oct 2022 12:29:38 GMT
- Title: VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
- Authors: Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang
- Abstract summary: We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
- Score: 54.67547526785552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker extraction seeks to extract the target speech in a multi-talker
scenario given an auxiliary reference. Such reference can be auditory, i.e., a
pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic
sequence. References in different modalities provide distinct and complementary
information that could be fused to form top-down attention on the target
speaker. Previous studies have introduced visual and contextual modalities in a
single model. In this paper, we propose a two-stage time-domain
visual-contextual speaker extraction network named VCSE, which incorporates
visual and self-enrolled contextual cues stage by stage to take full advantage
of every modality. In the first stage, we pre-extract a target speech with
visual cues and estimate the underlying phonetic sequence. In the second stage,
we refine the pre-extracted target speech with the self-enrolled contextual
cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3)
database demonstrate that our proposed VCSE network consistently outperforms
other state-of-the-art baselines.
Related papers
- Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction [13.5641621193917]
In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance.
Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production.
We introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements.
arXiv Detail & Related papers (2024-04-19T09:08:44Z) - Audio-Visual Neural Syntax Acquisition [91.14892278795892]
We study phrase structure induction from visually-grounded speech.
We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text.
arXiv Detail & Related papers (2023-10-11T16:54:57Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - ESSumm: Extractive Speech Summarization from Untranscribed Meeting [7.309214379395552]
We propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm.
We leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio.
Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length.
arXiv Detail & Related papers (2022-09-14T20:13:15Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.