Representation Purification for End-to-End Speech Translation
- URL: http://arxiv.org/abs/2412.04266v1
- Date: Thu, 05 Dec 2024 15:50:44 GMT
- Title: Representation Purification for End-to-End Speech Translation
- Authors: Chengwei Zhang, Yue Zhou, Rui Zhao, Yidong Chen, Xiaodong Shi,
- Abstract summary: Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language.
We conceptualize speech representation as a combination of content-agnostic and content-relevant factors.
- Score: 16.967317436711113
- License:
- Abstract: Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language. Previous research primarily focused on enhancing speech translation by facilitating knowledge transfer from machine translation, exploring various methods to bridge the gap between speech and text modalities. Despite substantial progress made, factors in speech that are not relevant to translation content, such as timbre and rhythm, often limit the efficiency of knowledge transfer. In this paper, we conceptualize speech representation as a combination of content-agnostic and content-relevant factors. We examine the impact of content-agnostic factors on translation performance through preliminary experiments and observe a significant performance deterioration when content-agnostic perturbations are introduced to speech signals. To address this issue, we propose a \textbf{S}peech \textbf{R}epresentation \textbf{P}urification with \textbf{S}upervision \textbf{E}nhancement (SRPSE) framework, which excludes the content-agnostic components within speech representations to mitigate their negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate that SRPSE significantly improves translation performance across all translation directions in three settings and achieves preeminent performance under a \textit{transcript-free} setting.
Related papers
- Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text.
We incorporate additional contextual cues together with the signing video.
We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z) - Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? [7.682929772871941]
prosody is rarely studied within the context of speech-to-text translation systems.
End-to-end (E2E) systems have direct access to the speech signal when making translation decisions.
A main challenge is the difficulty of evaluating prosody awareness in translation.
arXiv Detail & Related papers (2024-10-31T15:20:50Z) - Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP [46.53595526049201]
A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images.
We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI)
SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance.
arXiv Detail & Related papers (2024-10-11T02:42:13Z) - Soft Alignment of Modality Space for End-to-end Speech Translation [49.29045524083467]
End-to-end Speech Translation aims to convert speech into target text within a unified model.
The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer.
We introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities.
arXiv Detail & Related papers (2023-12-18T06:08:51Z) - MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by
Leveraging Unstructured Context in Neural Machine Translation [3.703767478524629]
This work introduces MTCue, a novel neural machine translation (NMT) framework that interprets all context (including discrete variables) as text.
MTCue learns an abstract representation of context, enabling transferability across different data settings.
MTCue significantly outperforms a "tagging" baseline at translating English text.
arXiv Detail & Related papers (2023-05-25T10:06:08Z) - Improving Speech Translation by Understanding and Learning from the
Auxiliary Text Translation Task [26.703809355057224]
We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework.
Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities.
Inspired by these findings, we propose three methods to improve translation quality.
arXiv Detail & Related papers (2021-07-12T23:53:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.