SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic
Speech Processing
- URL: http://arxiv.org/abs/2302.14638v1
- Date: Mon, 27 Feb 2023 11:48:54 GMT
- Title: SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic
Speech Processing
- Authors: Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du
- Abstract summary: Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses.
We consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing.
SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks.
- Score: 17.128885611538486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Paralinguistic speech processing is important in addressing many issues, such
as sentiment and neurocognitive disorder analyses. Recently, Transformer has
achieved remarkable success in the natural language processing field and has
demonstrated its adaptation to speech. However, previous works on Transformer
in the speech field have not incorporated the properties of speech, leaving the
full potential of Transformer unexplored. In this paper, we consider the
characteristics of speech and propose a general structure-based framework,
called SpeechFormer++, for paralinguistic speech processing. More concretely,
following the component relationship in the speech signal, we design a unit
encoder to model the intra- and inter-unit information (i.e., frames, phones,
and words) efficiently. According to the hierarchical relationship, we utilize
merging blocks to generate features at different granularities, which is
consistent with the structural pattern in the speech signal. Moreover, a word
encoder is introduced to integrate word-grained features into each unit
encoder, which effectively balances fine-grained and coarse-grained
information. SpeechFormer++ is evaluated on the speech emotion recognition
(IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease
detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the
standard Transformer while greatly reducing the computational cost.
Furthermore, it delivers superior results compared to the state-of-the-art
approaches.
Related papers
- dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel)
Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z) - Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation [23.757896930482342]
This work explores the selection process through a study of downstream tasks.
Units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy.
arXiv Detail & Related papers (2024-07-08T08:53:26Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.