Enhancing End-to-End Conversational Speech Translation Through Target
Language Context Utilization
- URL: http://arxiv.org/abs/2309.15686v1
- Date: Wed, 27 Sep 2023 14:32:30 GMT
- Title: Enhancing End-to-End Conversational Speech Translation Through Target
Language Context Utilization
- Authors: Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe,
Sanjeev Khudanpur
- Abstract summary: We introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments.
Our proposed contextual E2E-ST outperforms the isolated utterance-based E2E-ST approach.
- Score: 73.85027121522295
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Incorporating longer context has been shown to benefit machine translation,
but the inclusion of context in end-to-end speech translation (E2E-ST) remains
under-studied. To bridge this gap, we introduce target language context in
E2E-ST, enhancing coherence and overcoming memory constraints of extended audio
segments. Additionally, we propose context dropout to ensure robustness to the
absence of context, and further improve performance by adding speaker
information. Our proposed contextual E2E-ST outperforms the isolated
utterance-based E2E-ST approach. Lastly, we demonstrate that in conversational
speech, contextual information primarily contributes to capturing context
style, as well as resolving anaphora and named entities.
Related papers
- Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Long-form Simultaneous Speech Translation: Thesis Proposal [3.252719444437546]
Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence.
Deep learning has sparked significant interest in end-to-end (E2E) systems.
This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting.
arXiv Detail & Related papers (2023-10-17T10:44:05Z) - Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language
Understanding [14.157311972146692]
We propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts.
Our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively.
arXiv Detail & Related papers (2021-12-13T15:49:36Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - Improving Speech Enhancement Performance by Leveraging Contextual Broad
Phonetic Class Information [33.79855927394387]
We explore the contextual information of articulatory attributes as additional information to further benefit speech enhancement.
We propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition model.
Experimental results from speech denoising, speech dereverb, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance.
arXiv Detail & Related papers (2020-11-15T03:56:37Z) - End-to-end Named Entity Recognition from English Speech [51.22888702264816]
We introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimize the ASR and NER tagger components.
We also discuss how NER from speech can be used to handle out of vocabulary (OOV) words in an ASR system.
arXiv Detail & Related papers (2020-05-22T13:39:14Z) - Contextual Neural Machine Translation Improves Translation of Cataphoric
Pronouns [50.245845110446496]
We investigate the effect of future sentences as context by comparing the performance of a contextual NMT model trained with the future context to the one trained with the past context.
Our experiments and evaluation, using generic and pronoun-focused automatic metrics, show that the use of future context achieves significant improvements over the context-agnostic Transformer.
arXiv Detail & Related papers (2020-04-21T10:45:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.