AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation
- URL: http://arxiv.org/abs/2503.14185v1
- Date: Tue, 18 Mar 2025 11:59:27 GMT
- Title: AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation
- Authors: Wuwei Huang, Dexin Wang, Deyi Xiong,
- Abstract summary: We show the benefits of varying acoustic states according to decoder hidden states.<n>We propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder.<n>Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.
- Score: 44.76424642509807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sublayer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks [28.440232737011453]
We propose a solution by combining Transducer and Attention based AED-Decoder (TAED) for speech-to-text tasks.
The new method leverages Transducer's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property.
We evaluate the proposed approach on the textscMuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks.
arXiv Detail & Related papers (2023-05-04T18:34:50Z) - Linguistic-Enhanced Transformer with CTC Embedding for Speech
Recognition [29.1423215212174]
Recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR)
We propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process.
Experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%.
arXiv Detail & Related papers (2022-10-25T08:12:59Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model [34.061441900912136]
We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
arXiv Detail & Related papers (2021-10-31T09:28:04Z) - SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language
Processing [77.4527868307914]
We propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.
To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text.
arXiv Detail & Related papers (2021-10-14T07:59:27Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.