Speech Recognition Transformers: Topological-lingualism Perspective
- URL: http://arxiv.org/abs/2408.14991v1
- Date: Tue, 27 Aug 2024 12:15:43 GMT
- Title: Speech Recognition Transformers: Topological-lingualism Perspective
- Authors: Shruti Singh, Muskaan Singh, Virender Kadyan,
- Abstract summary: The paper presents a comprehensive survey of transformer techniques oriented in speech modality.
The main contents of this survey include (1) background of traditional ASR, end-to-end transformer ecosystem, and speech transformers (2) foundational models in a speech via lingualism paradigm.
- Score: 5.874509965718588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have evolved with great success in various artificial intelligence tasks. Thanks to our recent prevalence of self-attention mechanisms, which capture long-term dependency, phenomenal outcomes in speech processing and recognition tasks have been produced. The paper presents a comprehensive survey of transformer techniques oriented in speech modality. The main contents of this survey include (1) background of traditional ASR, end-to-end transformer ecosystem, and speech transformers (2) foundational models in a speech via lingualism paradigm, i.e., monolingual, bilingual, multilingual, and cross-lingual (3) dataset and languages, acoustic features, architecture, decoding, and evaluation metric from a specific topological lingualism perspective (4) popular speech transformer toolkit for building end-to-end ASR systems. Finally, highlight the discussion of open challenges and potential research directions for the community to conduct further research in this domain.
Related papers
- MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data.
Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Transformers in Speech Processing: A Survey [4.984401393225283]
transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and multimodal applications.
We present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology.
arXiv Detail & Related papers (2023-03-21T06:00:39Z) - SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic
Speech Processing [17.128885611538486]
Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses.
We consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing.
SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks.
arXiv Detail & Related papers (2023-02-27T11:48:54Z) - Probing Speech Emotion Recognition Transformers for Linguistic Knowledge [7.81884995637243]
We investigate the extent in which linguistic information is exploited during speech emotion recognition fine-tuning.
We synthesise prosodically neutral speech utterances while varying the sentiment of the text.
Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers.
arXiv Detail & Related papers (2022-04-01T12:47:45Z) - Multi-View Self-Attention Based Transformer for Speaker Recognition [33.21173007319178]
Transformer model is widely used for speech processing tasks such as speaker recognition.
We propose a novel multi-view self-attention mechanism for speaker Transformer.
We show that the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
arXiv Detail & Related papers (2021-10-11T07:03:23Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.