TransAug: Translate as Augmentation for Sentence Embeddings
- URL: http://arxiv.org/abs/2111.00157v1
- Date: Sat, 30 Oct 2021 03:13:28 GMT
- Title: TransAug: Translate as Augmentation for Sentence Embeddings
- Authors: Jue Wang, Haofan Wang, Xing Wu, Chaochen Gao, Debing Zhang
- Abstract summary: We present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text.
Instead of adopting an encoder trained in other languages setting, we first distill a Chinese encoder from a SimCSE encoder (pretrained in English), so that their embeddings are close in semantic space, which can be regraded as implicit data augmentation.
Our approach achieves a new state-of-art on standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks evaluated by SentEval.
- Score: 8.89078869712101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While contrastive learning greatly advances the representation of sentence
embeddings, it is still limited by the size of the existing sentence datasets.
In this paper, we present TransAug (Translate as Augmentation), which provide
the first exploration of utilizing translated sentence pairs as data
augmentation for text, and introduce a two-stage paradigm to advances the
state-of-the-art sentence embeddings. Instead of adopting an encoder trained in
other languages setting, we first distill a Chinese encoder from a SimCSE
encoder (pretrained in English), so that their embeddings are close in semantic
space, which can be regraded as implicit data augmentation. Then, we only
update the English encoder via cross-lingual contrastive learning and frozen
the distilled Chinese encoder. Our approach achieves a new state-of-art on
standard semantic textual similarity (STS), outperforming both SimCSE and
Sentence-T5, and the best performance in corresponding tracks on transfer tasks
evaluated by SentEval.
Related papers
- Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - UTSGAN: Unseen Transition Suss GAN for Transition-Aware Image-to-image
Translation [57.99923293611923]
We introduce a transition-aware approach to I2I translation, where the data translation mapping is explicitly parameterized with a transition variable.
We propose the use of transition consistency, defined on the transition variable, to enable regularization of consistency on unobserved translations.
Based on these insights, we present Unseen Transition Suss GAN (UTSGAN), a generative framework that constructs a manifold for the transition with a transition encoder.
arXiv Detail & Related papers (2023-04-24T09:47:34Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Trans-Encoder: Unsupervised sentence-pair modelling through self- and
mutual-distillations [22.40667024030858]
Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient.
Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance.
Trans-Encoder combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders.
arXiv Detail & Related papers (2021-09-27T14:06:47Z) - Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text
Models [10.645591218689058]
We provide the first exploration of text-to-text transformers (T5) sentence embeddings.
We investigate three methods for extracting T5 sentence embeddings.
Our encoder-only models outperforms BERT-based sentence embeddings on both transfer tasks and semantic textual similarity.
arXiv Detail & Related papers (2021-08-19T18:58:02Z) - Discrete Cosine Transform as Universal Sentence Encoder [10.355894890759377]
We use Discrete Cosine Transform (DCT) to generate universal sentence representation for different languages.
The experimental results clearly show the superior effectiveness of DCT encoding.
arXiv Detail & Related papers (2021-06-02T04:43:54Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.