Alignment-Free Training for Transducer-based Multi-Talker ASR
- URL: http://arxiv.org/abs/2409.20301v1
- Date: Mon, 30 Sep 2024 13:58:11 GMT
- Title: Alignment-Free Training for Transducer-based Multi-Talker ASR
- Authors: Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura,
- Abstract summary: Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation.
We propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture.
- Score: 55.1234384771616
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker's appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
Related papers
- Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances.
We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder.
Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks [28.440232737011453]
We propose a solution by combining Transducer and Attention based AED-Decoder (TAED) for speech-to-text tasks.
The new method leverages Transducer's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property.
We evaluate the proposed approach on the textscMuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks.
arXiv Detail & Related papers (2023-05-04T18:34:50Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - End-to-End Rich Transcription-Style Automatic Speech Recognition with
Semi-Supervised Learning [28.516240952627076]
We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems.
The Key process in our learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset.
Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-07-07T12:52:49Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.