ConceptBeam: Concept Driven Target Speech Extraction
- URL: http://arxiv.org/abs/2207.11964v1
- Date: Mon, 25 Jul 2022 08:06:07 GMT
- Title: ConceptBeam: Concept Driven Target Speech Extraction
- Authors: Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki
Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, and Kunio Kashino
- Abstract summary: We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam.
In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space.
We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept.
- Score: 69.85003619274295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel framework for target speech extraction based on semantic
information, called ConceptBeam. Target speech extraction means extracting the
speech of a target speaker in a mixture. Typical approaches have been
exploiting properties of audio signals, such as harmonic structure and
direction of arrival. In contrast, ConceptBeam tackles the problem with
semantic clues. Specifically, we extract the speech of speakers speaking about
a concept, i.e., a topic of interest, using a concept specifier such as an
image or speech. Solving this novel problem would open the door to innovative
applications such as listening systems that focus on a particular topic
discussed in a conversation. Unlike keywords, concepts are abstract notions,
making it challenging to directly represent a target concept. In our scheme, a
concept is encoded as a semantic embedding by mapping the concept specifier to
a shared embedding space. This modality-independent space can be built by means
of deep metric learning using paired data consisting of images and their spoken
captions. We use it to bridge modality-dependent information, i.e., the speech
segments in the mixture, and the specified, modality-independent concept. As a
proof of our scheme, we performed experiments using a set of images associated
with spoken captions. That is, we generated speech mixtures from these spoken
captions and used the images or speech signals as the concept specifiers. We
then extracted the target speech using the acoustic characteristics of the
identified segments. We compare ConceptBeam with two methods: one based on
keywords obtained from recognition systems and another based on sound source
separation. We show that ConceptBeam clearly outperforms the baseline methods
and effectively extracts speech based on the semantic representation.
Related papers
- Disentangling Textual and Acoustic Features of Neural Speech Representations [23.486891834252535]
We build upon the Information Bottleneck principle to propose a disentanglement framework for complex speech representations.
We apply our framework to emotion recognition and speaker identification downstream tasks.
arXiv Detail & Related papers (2024-10-03T22:48:04Z) - Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction [13.5641621193917]
In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance.
Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production.
We introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements.
arXiv Detail & Related papers (2024-04-19T09:08:44Z) - Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and
Phoneme Duration for Multi-Speaker Speech Synthesis [16.497022070614236]
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker.
A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm.
arXiv Detail & Related papers (2024-02-11T02:26:43Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Visual Concepts Tokenization [65.61987357146997]
We propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens.
To obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens.
We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts.
arXiv Detail & Related papers (2022-05-20T11:25:31Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - FaceFilter: Audio-visual speech separation using still images [41.97445146257419]
This paper aims to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker.
arXiv Detail & Related papers (2020-05-14T15:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.