Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation
- URL: http://arxiv.org/abs/2310.14806v1
- Date: Mon, 23 Oct 2023 11:00:27 GMT
- Title: Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation
- Authors: Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu
Li, Yashesh Gaur
- Abstract summary: We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
- Score: 51.399695200838586
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The growing need for instant spoken language transcription and translation is
driven by increased global communication and cross-lingual interactions. This
has made offering translations in multiple languages essential for user
applications. Traditional approaches to automatic speech recognition (ASR) and
speech translation (ST) have often relied on separate systems, leading to
inefficiencies in computational resources, and increased synchronization
complexity in real time. In this paper, we propose a streaming
Transformer-Transducer (T-T) model able to jointly produce many-to-one and
one-to-many transcription and translation using a single decoder. We introduce
a novel method for joint token-level serialized output training based on
timestamp information to effectively produce ASR and ST outputs in the
streaming setting. Experiments on {it,es,de}->en prove the effectiveness of our
approach, enabling the generation of one-to-many joint outputs with a single
decoder for the first time.
Related papers
- Alignment-Free Training for Transducer-based Multi-Talker ASR [55.1234384771616]
Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation.
We propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture.
arXiv Detail & Related papers (2024-09-30T13:58:11Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Back Translation for Speech-to-text Translation Without Transcripts [11.13240570688547]
We develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data.
To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units.
With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets.
arXiv Detail & Related papers (2023-05-15T15:12:40Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized
Streaming ASR [21.622039537743607]
Simultaneous speech-to-text translation is widely useful in many scenarios.
Recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks.
We propose a new paradigm with the advantages of both cascaded and end-to-end approaches.
arXiv Detail & Related papers (2021-06-11T23:22:37Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Streaming Simultaneous Speech Translation with Augmented Memory
Transformer [29.248366441276662]
Transformer-based models have achieved state-of-the-art performance on speech translation tasks.
We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder.
arXiv Detail & Related papers (2020-10-30T18:28:42Z) - SimulEval: An Evaluation Toolkit for Simultaneous Translation [59.02724214432792]
Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario.
SimulEval is an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation.
arXiv Detail & Related papers (2020-07-31T17:44:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.