TransAug: Translate as Augmentation for Sentence Embeddings
- URL: http://arxiv.org/abs/2111.00157v1
- Date: Sat, 30 Oct 2021 03:13:28 GMT
- Title: TransAug: Translate as Augmentation for Sentence Embeddings
- Authors: Jue Wang, Haofan Wang, Xing Wu, Chaochen Gao, Debing Zhang
- Abstract summary: We present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text.
Instead of adopting an encoder trained in other languages setting, we first distill a Chinese encoder from a SimCSE encoder (pretrained in English), so that their embeddings are close in semantic space, which can be regraded as implicit data augmentation.
Our approach achieves a new state-of-art on standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks evaluated by SentEval.
- Score: 8.89078869712101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While contrastive learning greatly advances the representation of sentence
embeddings, it is still limited by the size of the existing sentence datasets.
In this paper, we present TransAug (Translate as Augmentation), which provide
the first exploration of utilizing translated sentence pairs as data
augmentation for text, and introduce a two-stage paradigm to advances the
state-of-the-art sentence embeddings. Instead of adopting an encoder trained in
other languages setting, we first distill a Chinese encoder from a SimCSE
encoder (pretrained in English), so that their embeddings are close in semantic
space, which can be regraded as implicit data augmentation. Then, we only
update the English encoder via cross-lingual contrastive learning and frozen
the distilled Chinese encoder. Our approach achieves a new state-of-art on
standard semantic textual similarity (STS), outperforming both SimCSE and
Sentence-T5, and the best performance in corresponding tracks on transfer tasks
evaluated by SentEval.
Related papers
- Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text.
We incorporate additional contextual cues together with the signing video.
We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z) - Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation [28.07831604833682]
We investigate the issue of the decoder-only architecture to its lack of language transfer capability.
We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage.
We impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation.
arXiv Detail & Related papers (2024-12-03T02:52:14Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Trans-Encoder: Unsupervised sentence-pair modelling through self- and
mutual-distillations [22.40667024030858]
Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient.
Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance.
Trans-Encoder combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders.
arXiv Detail & Related papers (2021-09-27T14:06:47Z) - Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text
Models [10.645591218689058]
We provide the first exploration of text-to-text transformers (T5) sentence embeddings.
We investigate three methods for extracting T5 sentence embeddings.
Our encoder-only models outperforms BERT-based sentence embeddings on both transfer tasks and semantic textual similarity.
arXiv Detail & Related papers (2021-08-19T18:58:02Z) - Discrete Cosine Transform as Universal Sentence Encoder [10.355894890759377]
We use Discrete Cosine Transform (DCT) to generate universal sentence representation for different languages.
The experimental results clearly show the superior effectiveness of DCT encoding.
arXiv Detail & Related papers (2021-06-02T04:43:54Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.