Analysis of Joint Speech-Text Embeddings for Semantic Matching
- URL: http://arxiv.org/abs/2204.01235v1
- Date: Mon, 4 Apr 2022 04:50:32 GMT
- Title: Analysis of Joint Speech-Text Embeddings for Semantic Matching
- Authors: Muhammad Huzaifah and Ivan Kukanov
- Abstract summary: We study a joint speech-text embedding space trained for semantic matching by minimizing the distance between paired utterance and transcription inputs.
We extend our method to incorporate automatic speech recognition through both pretraining and multitask scenarios.
- Score: 3.6423306784901235
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Embeddings play an important role in many recent end-to-end solutions for
language processing problems involving more than one data modality. Although
there has been some effort to understand the properties of single-modality
embedding spaces, particularly that of text, their cross-modal counterparts are
less understood. In this work, we study a joint speech-text embedding space
trained for semantic matching by minimizing the distance between paired
utterance and transcription inputs. This was done through dual encoders in a
teacher-student model setup, with a pretrained language model acting as the
teacher and a transformer-based speech encoder as the student. We extend our
method to incorporate automatic speech recognition through both pretraining and
multitask scenarios and found that both approaches improve semantic matching.
Multiple techniques were utilized to analyze and evaluate cross-modal semantic
alignment of the embeddings: a quantitative retrieval accuracy metric,
zero-shot classification to investigate generalizability, and probing of the
encoders to observe the extent of knowledge transfer from one modality to
another.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer [78.35816158511523]
We present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT) for simultaneous subject localization and emotion classification.
We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC.
arXiv Detail & Related papers (2024-04-26T07:30:32Z) - Topic-DPR: Topic-based Prompts for Dense Passage Retrieval [6.265789210037749]
We present Topic-DPR, a dense passage retrieval model that uses topic-based prompts.
We introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency.
arXiv Detail & Related papers (2023-10-10T13:45:24Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Coherence and Diversity through Noise: Self-Supervised Paraphrase
Generation via Structure-Aware Denoising [5.682665111938764]
We propose SCANING, an unsupervised framework for paraphrasing via controlled noise injection.
We focus on the novel task of paraphrasing algebraic word problems having practical applications in online pedagogy.
We demonstrate SCANING considerably improves performance in terms of both semantic preservation and producing diverse paraphrases.
arXiv Detail & Related papers (2023-02-06T13:50:57Z) - Pre-trained Sentence Embeddings for Implicit Discourse Relation
Classification [26.973476248983477]
Implicit discourse relations bind smaller linguistic units into coherent texts.
We explore the utility of pre-trained sentence embeddings as base representations in a neural network for implicit discourse relation sense classification.
arXiv Detail & Related papers (2022-10-20T04:17:03Z) - Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken
Conversations [22.894541507068933]
This paper presents our approach to build generalized models for the Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations Challenge of DSTC-10.
We employ extensive data augmentation strategies on written data, including artificial error injection and round-trip text-speech transformation.
Our approach ranks third on the objective evaluation and second on the final official human evaluation.
arXiv Detail & Related papers (2022-03-08T12:26:57Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Metaphor Detection using Deep Contextualized Word Embeddings [0.0]
We present an end-to-end method composed of deep contextualized word embeddings, bidirectional LSTMs and multi-head attention mechanism.
Our method requires only the raw text sequences as input features to detect the metaphoricity of a phrase.
arXiv Detail & Related papers (2020-09-26T11:00:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.