Consistent Transcription and Translation of Speech
- URL: http://arxiv.org/abs/2007.12741v2
- Date: Fri, 28 Aug 2020 07:57:57 GMT
- Title: Consistent Transcription and Translation of Speech
- Authors: Matthias Sperber, Hendra Setiawan, Christian Gollan, Udhyakumar
Nallasamy, Matthias Paulik
- Abstract summary: We explore the task of jointly transcribing and translating speech.
While high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs.
We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency.
- Score: 13.652411093089947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The conventional paradigm in speech translation starts with a speech
recognition step to generate transcripts, followed by a translation step with
the automatic transcripts as input. To address various shortcomings of this
paradigm, recent work explores end-to-end trainable direct models that
translate without transcribing. However, transcripts can be an indispensable
output in practical applications, which often display transcripts alongside the
translations to users.
We make this common requirement explicit and explore the task of jointly
transcribing and translating speech. While high accuracy of transcript and
translation are crucial, even highly accurate systems can suffer from
inconsistencies between both outputs that degrade the user experience. We
introduce a methodology to evaluate consistency and compare several modeling
approaches, including the traditional cascaded approach and end-to-end models.
We find that direct models are poorly suited to the joint
transcription/translation task, but that end-to-end models that feature a
coupled inference procedure are able to achieve strong consistency. We further
introduce simple techniques for directly optimizing for consistency, and
analyze the resulting trade-offs between consistency, transcription accuracy,
and translation accuracy.
Related papers
- Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model [0.0]
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains.
We propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters.
arXiv Detail & Related papers (2024-10-24T01:58:11Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - Direct Speech-to-speech Translation without Textual Annotation using
Bottleneck Features [13.44542301438426]
We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information.
Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach.
arXiv Detail & Related papers (2022-12-12T10:03:10Z) - Principled Paraphrase Generation with Parallel Corpora [52.78059089341062]
We formalize the implicit similarity function induced by round-trip Machine Translation.
We show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation.
We design an alternative similarity metric that mitigates this issue.
arXiv Detail & Related papers (2022-05-24T17:22:42Z) - STEMM: Self-learning with Speech-text Manifold Mixup for Speech
Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy.
Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z) - Streaming Models for Joint Speech Recognition and Translation [11.657994715914748]
We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches.
We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders.
arXiv Detail & Related papers (2021-01-22T15:16:54Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Textual Supervision for Visually Grounded Spoken Language Understanding [51.93744335044475]
Visually-grounded models of spoken language understanding extract semantic information directly from speech.
This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain.
Recent work showed that these models can be improved if transcriptions are available at training time.
arXiv Detail & Related papers (2020-10-06T15:16:23Z) - A Probabilistic Formulation of Unsupervised Text Style Transfer [128.80213211598752]
We present a deep generative model for unsupervised text style transfer that unifies previously proposed non-generative techniques.
By hypothesizing a parallel latent sequence that generates each observed sequence, our model learns to transform sequences from one domain to another in a completely unsupervised fashion.
arXiv Detail & Related papers (2020-02-10T16:20:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.