Related papers: Analysis of Joint Speech-Text Embeddings for Semantic Matching

Analysis of Joint Speech-Text Embeddings for Semantic Matching

URL: http://arxiv.org/abs/2204.01235v1
Date: Mon, 4 Apr 2022 04:50:32 GMT
Title: Analysis of Joint Speech-Text Embeddings for Semantic Matching
Authors: Muhammad Huzaifah and Ivan Kukanov
Abstract summary: We study a joint speech-text embedding space trained for semantic matching by minimizing the distance between paired utterance and transcription inputs. We extend our method to incorporate automatic speech recognition through both pretraining and multitask scenarios.
Score: 3.6423306784901235
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Embeddings play an important role in many recent end-to-end solutions for language processing problems involving more than one data modality. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their cross-modal counterparts are less understood. In this work, we study a joint speech-text embedding space trained for semantic matching by minimizing the distance between paired utterance and transcription inputs. This was done through dual encoders in a teacher-student model setup, with a pretrained language model acting as the teacher and a transformer-based speech encoder as the student. We extend our method to incorporate automatic speech recognition through both pretraining and multitask scenarios and found that both approaches improve semantic matching. Multiple techniques were utilized to analyze and evaluate cross-modal semantic alignment of the embeddings: a quantitative retrieval accuracy metric, zero-shot classification to investigate generalizability, and probing of the encoders to observe the extent of knowledge transfer from one modality to another.

Related papers

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers. We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z)
Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency. Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z)
Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer [78.35816158511523]
We present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT) for simultaneous subject localization and emotion classification. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC.
arXiv Detail & Related papers (2024-04-26T07:30:32Z)
Learning Disentangled Speech Representations [0.412484724941528]
SynSpeech is a novel large-scale synthetic speech dataset designed to enable research on disentangled speech representations. We present a framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics. We find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity.
arXiv Detail & Related papers (2023-11-04T04:54:17Z)
Topic-DPR: Topic-based Prompts for Dense Passage Retrieval [6.265789210037749]
We present Topic-DPR, a dense passage retrieval model that uses topic-based prompts. We introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency.
arXiv Detail & Related papers (2023-10-10T13:45:24Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Coherence and Diversity through Noise: Self-Supervised Paraphrase Generation via Structure-Aware Denoising [5.682665111938764]
We propose SCANING, an unsupervised framework for paraphrasing via controlled noise injection. We focus on the novel task of paraphrasing algebraic word problems having practical applications in online pedagogy. We demonstrate SCANING considerably improves performance in terms of both semantic preservation and producing diverse paraphrases.
arXiv Detail & Related papers (2023-02-06T13:50:57Z)
Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken Conversations [22.894541507068933]
This paper presents our approach to build generalized models for the Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations Challenge of DSTC-10. We employ extensive data augmentation strategies on written data, including artificial error injection and round-trip text-speech transformation. Our approach ranks third on the objective evaluation and second on the final official human evaluation.
arXiv Detail & Related papers (2022-03-08T12:26:57Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
Metaphor Detection using Deep Contextualized Word Embeddings [0.0]
We present an end-to-end method composed of deep contextualized word embeddings, bidirectional LSTMs and multi-head attention mechanism. Our method requires only the raw text sequences as input features to detect the metaphoricity of a phrase.
arXiv Detail & Related papers (2020-09-26T11:00:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.