Align, Write, Re-order: Explainable End-to-End Speech Translation via
Operation Sequence Generation
- URL: http://arxiv.org/abs/2211.05967v1
- Date: Fri, 11 Nov 2022 02:29:28 GMT
- Title: Align, Write, Re-order: Explainable End-to-End Speech Translation via
Operation Sequence Generation
- Authors: Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji
Watanabe
- Abstract summary: We propose to generate ST tokens out-of-order while remembering how to re-order them later.
We examine two variants of such operation sequences which enable generation of monotonic transcriptions and non-monotonic translations.
- Score: 37.48971774827332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The black-box nature of end-to-end speech translation (E2E ST) systems makes
it difficult to understand how source language inputs are being mapped to the
target language. To solve this problem, we would like to simultaneously
generate automatic speech recognition (ASR) and ST predictions such that each
source language word is explicitly mapped to a target language word. A major
challenge arises from the fact that translation is a non-monotonic sequence
transduction task due to word ordering differences between languages -- this
clashes with the monotonic nature of ASR. Therefore, we propose to generate ST
tokens out-of-order while remembering how to re-order them later. We achieve
this by predicting a sequence of tuples consisting of a source word, the
corresponding target words, and post-editing operations dictating the correct
insertion points for the target word. We examine two variants of such operation
sequences which enable generation of monotonic transcriptions and non-monotonic
translations from the same speech input simultaneously. We apply our approach
to offline and real-time streaming models, demonstrating that we can provide
explainable translations without sacrificing quality or latency. In fact, the
delayed re-ordering ability of our approach improves performance during
streaming. As an added benefit, our method performs ASR and ST simultaneously,
making it faster than using two separate systems to perform these tasks.
Related papers
- A Data-Driven Representation for Sign Language Production [26.520016084139964]
Sign Language Production aims to automatically translate spoken language sentences into continuous sequences of sign language.
Current state-of-the-art approaches rely on scarce linguistic resources to work.
This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem.
arXiv Detail & Related papers (2024-04-17T15:52:38Z) - Gujarati-English Code-Switching Speech Recognition using ensemble
prediction of spoken language [29.058108207186816]
We propose two methods of introducing language specific parameters and explainability in the multi-head attention mechanism.
Despite being unable to reduce WER significantly, our method shows promise in predicting the correct language from just spoken data.
arXiv Detail & Related papers (2024-03-12T18:21:20Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - AlloST: Low-resource Speech Translation without Source Transcription [17.53382405899421]
We propose a learning framework that utilizes a language-independent universal phone recognizer.
The framework is based on an attention-based sequence-to-sequence model.
Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline.
arXiv Detail & Related papers (2021-05-01T05:30:18Z) - Learning to Count Words in Fluent Speech enables Online Speech
Recognition [10.74796391075403]
We introduce Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting.
Experiments performed on the LRS2, LibriSpeech, and Aishell-1 datasets show that the online system performs comparable with the offline one when having a dynamic algorithmic delay of 5 segments.
arXiv Detail & Related papers (2020-06-08T20:49:39Z) - Neural Syntactic Preordering for Controlled Paraphrase Generation [57.5316011554622]
Our work uses syntactic transformations to softly "reorder'' the source sentence and guide our neural paraphrasing model.
First, given an input sentence, we derive a set of feasible syntactic rearrangements using an encoder-decoder model.
Next, we use each proposed rearrangement to produce a sequence of position embeddings, which encourages our final encoder-decoder paraphrase model to attend to the source words in a particular order.
arXiv Detail & Related papers (2020-05-05T09:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.