Paraphrase Generation as Unsupervised Machine Translation
- URL: http://arxiv.org/abs/2109.02950v1
- Date: Tue, 7 Sep 2021 09:08:58 GMT
- Title: Paraphrase Generation as Unsupervised Machine Translation
- Authors: Chun Fan, Yufei Tian, Yuxian Meng, Nanyun Peng, Xiaofei Sun, Fei Wu
and Jiwei Li
- Abstract summary: We propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT)
The proposed paradigm first splits a large unlabeled corpus into multiple clusters, and trains multiple UMT models using pairs of these clusters.
Then based on the paraphrase pairs produced by these UMT models, a unified surrogate model can be trained to serve as the final Seq2Seq model to generate paraphrases.
- Score: 30.99150547499427
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new paradigm for paraphrase generation by
treating the task as unsupervised machine translation (UMT) based on the
assumption that there must be pairs of sentences expressing the same meaning in
a large-scale unlabeled monolingual corpus. The proposed paradigm first splits
a large unlabeled corpus into multiple clusters, and trains multiple UMT models
using pairs of these clusters. Then based on the paraphrase pairs produced by
these UMT models, a unified surrogate model can be trained to serve as the
final Seq2Seq model to generate paraphrases, which can be directly used for
test in the unsupervised setup, or be finetuned on labeled datasets in the
supervised setup. The proposed method offers merits over
machine-translation-based paraphrase generation methods, as it avoids reliance
on bilingual sentence pairs. It also allows human intervene with the model so
that more diverse paraphrases can be generated using different filtering
criteria. Extensive experiments on existing paraphrase dataset for both the
supervised and unsupervised setups demonstrate the effectiveness the proposed
paradigm.
Related papers
- Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Unsupervised Syntactically Controlled Paraphrase Generation with
Abstract Meaning Representations [59.10748929158525]
Abstract Representations (AMR) can greatly improve the performance of unsupervised syntactically controlled paraphrase generation.
Our proposed model, AMR-enhanced Paraphrase Generator (AMRPG), encodes the AMR graph and the constituency parses the input sentence into two disentangled semantic and syntactic embeddings.
Experiments show that AMRPG generates more accurate syntactically controlled paraphrases, both quantitatively and qualitatively, compared to the existing unsupervised approaches.
arXiv Detail & Related papers (2022-11-02T04:58:38Z) - ConRPG: Paraphrase Generation using Contexts as Regularizer [31.967883219986362]
A long-standing issue with paraphrase generation is how to obtain reliable supervision signals.
We propose an unsupervised paradigm for paraphrase generation based on the assumption that the probabilities of generating two sentences with the same meaning given the same context should be the same.
We propose a pipelined system which consists of paraphrase candidate generation based on contextual language models, candidate filtering using scoring functions, and paraphrase model training based on the selected candidates.
arXiv Detail & Related papers (2021-09-01T12:57:30Z) - Bilingual alignment transfers to multilingual alignment for unsupervised
parallel text mining [3.4519649635864584]
This work presents methods for learning cross-lingual sentence representations using paired or unpaired bilingual texts.
We hypothesize that the cross-lingual alignment strategy is transferable, and therefore a model trained to align only two languages can encode multilingually more aligned representations.
arXiv Detail & Related papers (2021-04-15T17:51:22Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - Unsupervised Paraphrase Generation using Pre-trained Language Models [0.0]
OpenAI's GPT-2 is notable for its capability to generate fluent, well formulated, grammatically consistent text.
We leverage this generation capability of GPT-2 to generate paraphrases without any supervision from labelled data.
Our experiments show that paraphrases generated with our model are of good quality, are diverse and improves the downstream task performance when used for data augmentation.
arXiv Detail & Related papers (2020-06-09T19:40:19Z) - A Probabilistic Formulation of Unsupervised Text Style Transfer [128.80213211598752]
We present a deep generative model for unsupervised text style transfer that unifies previously proposed non-generative techniques.
By hypothesizing a parallel latent sequence that generates each observed sequence, our model learns to transform sequences from one domain to another in a completely unsupervised fashion.
arXiv Detail & Related papers (2020-02-10T16:20:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.