Semi-Supervised Text Simplification with Back-Translation and Asymmetric
Denoising Autoencoders
- URL: http://arxiv.org/abs/2004.14693v1
- Date: Thu, 30 Apr 2020 11:19:04 GMT
- Title: Semi-Supervised Text Simplification with Back-Translation and Asymmetric
Denoising Autoencoders
- Authors: Yanbin Zhao, Lu Chen, Zhi Chen, Kai Yu
- Abstract summary: Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics.
This work investigates how to leverage large amounts of unpaired corpora in TS task.
We propose asymmetric denoising methods for sentences with separate complexity.
- Score: 37.949101113934226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text simplification (TS) rephrases long sentences into simplified variants
while preserving inherent semantics. Traditional sequence-to-sequence models
heavily rely on the quantity and quality of parallel sentences, which limits
their applicability in different languages and domains. This work investigates
how to leverage large amounts of unpaired corpora in TS task. We adopt the
back-translation architecture in unsupervised machine translation (NMT),
including denoising autoencoders for language modeling and automatic generation
of parallel data by iterative back-translation. However, it is non-trivial to
generate appropriate complex-simple pair if we directly treat the set of simple
and complex corpora as two different languages, since the two types of
sentences are quite similar and it is hard for the model to capture the
characteristics in different types of sentences. To tackle this problem, we
propose asymmetric denoising methods for sentences with separate complexity.
When modeling simple and complex sentences with autoencoders, we introduce
different types of noise into the training process. Such a method can
significantly improve the simplification performance. Our model can be trained
in both unsupervised and semi-supervised manner. Automatic and human
evaluations show that our unsupervised model outperforms the previous systems,
and with limited supervision, our model can perform competitively with multiple
state-of-the-art simplification systems.
Related papers
- Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences.
We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z) - Language Models for German Text Simplification: Overcoming Parallel Data
Scarcity through Style-specific Pre-training [0.0]
We propose a two-step approach to overcome data scarcity issue.
First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German.
We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts.
arXiv Detail & Related papers (2023-05-22T10:41:30Z) - Hierarchical Phrase-based Sequence-to-Sequence Learning [94.10257313923478]
We describe a neural transducer that maintains the flexibility of standard sequence-to-sequence (seq2seq) models while incorporating hierarchical phrases as a source of inductive bias during training and as explicit constraints during inference.
Our approach trains two models: a discriminative derivation based on a bracketing grammar whose tree hierarchically aligns source and target phrases, and a neural seq2seq model that learns to translate the aligned phrases one-by-one.
arXiv Detail & Related papers (2022-11-15T05:22:40Z) - A Template-based Method for Constrained Neural Machine Translation [100.02590022551718]
We propose a template-based method that can yield results with high translation quality and match accuracy while keeping the decoding speed.
The generation and derivation of the template can be learned through one sequence-to-sequence training framework.
Experimental results show that the proposed template-based methods can outperform several representative baselines in lexically and structurally constrained translation tasks.
arXiv Detail & Related papers (2022-05-23T12:24:34Z) - Unsupervised Mismatch Localization in Cross-Modal Sequential Data [5.932046800902776]
We develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal data.
We propose a hierarchical Bayesian deep learning model, named mismatch localization variational autoencoder (ML-VAE), that decomposes the generative process of the speech into hierarchically structured latent variables.
Our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations.
arXiv Detail & Related papers (2022-05-05T14:23:27Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Structured Reordering for Modeling Latent Alignments in Sequence
Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations.
The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z) - Simplify-then-Translate: Automatic Preprocessing for Black-Box Machine
Translation [5.480070710278571]
We introduce a method to improve black-box machine translation systems via automatic pre-processing (APP) using sentence simplification.
We first propose a method to automatically generate a large in-domain paraphrase corpus through back-translation with a black-box MT system.
We show that this preprocessing leads to better translation performance as compared to non-preprocessed source sentences.
arXiv Detail & Related papers (2020-05-22T14:15:53Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.