Related papers: Transformer-based language modeling and decoding for conversational speech recognition

Related papers

Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters [7.865191493201841]
Control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion.<n>We propose a convolutional neural network-based approach that aims to provide means for modifying fundamental frequency (F0), phoneme sequences, intensity, and speaker identity.<n>The results suggest that the proposed method offers substantial flexibility, while maintaining high intelligibility and speaker similarity.
arXiv Detail & Related papers (2025-07-07T09:36:00Z)
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion [16.19865417052239]
Discl-VC is a novel zero-shot voice conversion framework.<n>It disentangles content and prosody information from self-supervised speech representations.<n>It synthesizes the target speaker's voice through in-context learning.
arXiv Detail & Related papers (2025-05-30T07:04:23Z)
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation [44.76424642509807]
We show the benefits of varying acoustic states according to decoder hidden states. We propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.
arXiv Detail & Related papers (2025-03-18T11:59:27Z)
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z)
Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling [69.31802246621963]
We propose a novel memory-augmented transformer that is compatible with existing pre-trained encoder-decoder models. By incorporating a separate memory module alongside the pre-trained transformer, the model can effectively interchange information between the memory states and the current input context.
arXiv Detail & Related papers (2022-09-15T22:37:22Z)
Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism [20.782319059183173]
We propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition. We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.
arXiv Detail & Related papers (2022-07-02T17:17:47Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network. Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z)
Multi-View Self-Attention Based Transformer for Speaker Recognition [33.21173007319178]
Transformer model is widely used for speech processing tasks such as speaker recognition. We propose a novel multi-view self-attention mechanism for speaker Transformer. We show that the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
arXiv Detail & Related papers (2021-10-11T07:03:23Z)
Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model. We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z)
Streaming Simultaneous Speech Translation with Augmented Memory Transformer [29.248366441276662]
Transformer-based models have achieved state-of-the-art performance on speech translation tasks. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder.
arXiv Detail & Related papers (2020-10-30T18:28:42Z)
Investigation of Speaker-adaptation methods in Transformer based ASR [8.637110868126548]
This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance. We present speaker information in the form of speaker embeddings for each of the speakers. We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
arXiv Detail & Related papers (2020-08-07T16:09:03Z)
Relative Positional Encoding for Speech Recognition and Direct Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer. As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.