ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
- URL: http://arxiv.org/abs/2304.04596v3
- Date: Thu, 6 Jul 2023 20:07:49 GMT
- Title: ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
- Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng,
Siddharth Dalmia, Peter Pol\'ak, Patrick Fernandes, Dan Berrebbi, Tomoki
Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino,
Shinji Watanabe
- Abstract summary: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit.
This paper describes the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2.
- Score: 61.52122386938913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by
the broadening interests of the spoken language translation community.
ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2)
simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech
translation (S2ST) -- each task is supported with a wide variety of approaches,
differentiating ESPnet-ST-v2 from other open source spoken language translation
toolkits. This toolkit offers state-of-the-art architectures such as
transducers, hybrid CTC/attention, multi-decoders with searchable
intermediates, time-synchronous blockwise CTC/attention, Translatotron models,
and direct discrete unit models. In this paper, we describe the overall design,
example models for each task, and performance benchmarking behind ESPnet-ST-v2,
which is publicly available at https://github.com/espnet/espnet.
Related papers
- Towards Real-World Streaming Speech Translation for Code-Switched Speech [7.81154319203032]
Code-switching (CS) is a common phenomenon in communication and can be challenging in many Natural Language Processing (NLP) settings.
We focus on two essential yet unexplored areas for real-world CS speech translation: streaming settings and translation to a third language.
arXiv Detail & Related papers (2023-10-19T11:15:02Z) - Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z) - End-to-End Speech Translation for Code Switched Speech [13.97982457879585]
Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages.
We focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation.
We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.
arXiv Detail & Related papers (2022-04-11T13:25:30Z) - ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet [95.39817519115394]
ESPnet-SLU is a project inside end-to-end speech processing toolkit, ESPnet.
It is designed for quick development of spoken language understanding in a single framework.
arXiv Detail & Related papers (2021-11-29T17:05:49Z) - ESPnet2-TTS: Extending the Edge of TTS Research [62.92178873052468]
ESPnet2-TTS is an end-to-end text-to-speech (E2E-TTS) toolkit.
New features include: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling.
arXiv Detail & Related papers (2021-10-15T03:27:45Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.