Related papers: STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

URL: http://arxiv.org/abs/2011.11387v1
Date: Mon, 23 Nov 2020 13:29:16 GMT
Title: STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning
Authors: Prakamya Mishra
Abstract summary: We present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning spoken-word representations. STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word. Latent representations produced by our model were able to predict the target phonetic sequences with an accuracy of 89.47%.
Score: 2.28438857884398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations. STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word using its contextual spoken word's speech and text, such that the model encodes its meaningful latent representations. Unlike existing work, we have used text along with speech for auditory representation learning to capture semantical and syntactical information along with the acoustic and temporal information. The latent representations produced by our model were not only able to predict the target phonetic sequences with an accuracy of 89.47% but were also able to achieve competitive results to textual word representation models, Word2Vec & FastText (trained on textual transcripts), when evaluated on four widely used word similarity benchmark datasets. In addition, investigation of the generated vector space also demonstrated the capability of the proposed model to capture the phonetic structure of the spoken-words. To the best of our knowledge, none of the existing works use speech and text entanglement for learning spoken-word representation, which makes this work first of its kind.

Related papers

Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction [50.630431647192054]
This paper investigates a novel approach for Target Speech Extraction (TSE) It relies solely on textual context to extract the target speech. We present three CSE models and analyze their performances on three datasets.
arXiv Detail & Related papers (2025-03-11T18:26:10Z)
Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model [12.29892010056753]
We examine the text-free speech representations of raw audio obtained from a self-supervised learning (SSL) model. We show that using text representations is advantageous for preserving semantic information, while using discrete symbol representations is superior for preserving acoustic content.
arXiv Detail & Related papers (2024-12-04T06:52:03Z)
Few-Shot Spoken Language Understanding via Joint Speech-Text Models [18.193191170754744]
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations. We leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks. By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data.
arXiv Detail & Related papers (2023-10-09T17:59:21Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR [8.832255053182283]
We present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens. We show that our approach is highly accurate with correctly mitigating 93.3% of unintended user audio from invoking the smart assistant at 99% true positive rate.
arXiv Detail & Related papers (2022-10-21T17:45:00Z)
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings [19.195728241989702]
We propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of acoustic word embeddings. We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability.
arXiv Detail & Related papers (2022-09-14T13:33:04Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z)
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation [2.28438857884398]
We study the so-called latent language hypothesis (LLH) LLH connects linguistic representation learning to general predictive processing within and across sensory modalities. We explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning.
arXiv Detail & Related papers (2021-09-29T05:49:46Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Contextualized Spoken Word Representations from Convolutional Autoencoders [2.28438857884398]
This paper proposes a Convolutional Autoencoder based neural architecture to model syntactically and semantically adequate contextualized representations of varying length spoken words. The proposed model was able to demonstrate its robustness when compared to the other two language-based models.
arXiv Detail & Related papers (2020-07-06T16:48:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.