Unsupervised Text Embedding Space Generation Using Generative
Adversarial Networks for Text Synthesis
- URL: http://arxiv.org/abs/2306.17181v4
- Date: Tue, 17 Oct 2023 10:41:12 GMT
- Title: Unsupervised Text Embedding Space Generation Using Generative
Adversarial Networks for Text Synthesis
- Authors: Jun-Min Lee, Tae-Bin Ha
- Abstract summary: We propose Text Embedding Space Generative Adversarial Networks (TESGAN) to solve the gradient backpropagation problem.
TESGAN conducts unsupervised learning which does not directly refer to the text of the training data to overcome the data memorization issue.
- Score: 0.43512163406551996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative Adversarial Networks (GAN) is a model for data synthesis, which
creates plausible data through the competition of generator and discriminator.
Although GAN application to image synthesis is extensively studied, it has
inherent limitations to natural language generation. Because natural language
is composed of discrete tokens, a generator has difficulty updating its
gradient through backpropagation; therefore, most text-GAN studies generate
sentences starting with a random token based on a reward system. Thus, the
generators of previous studies are pre-trained in an autoregressive way before
adversarial training, causing data memorization that synthesized sentences
reproduce the training data. In this paper, we synthesize sentences using a
framework similar to the original GAN. More specifically, we propose Text
Embedding Space Generative Adversarial Networks (TESGAN) which generate
continuous text embedding spaces instead of discrete tokens to solve the
gradient backpropagation problem. Furthermore, TESGAN conducts unsupervised
learning which does not directly refer to the text of the training data to
overcome the data memorization issue. By adopting this novel method, TESGAN can
synthesize new sentences, showing the potential of unsupervised learning for
text synthesis. We expect to see extended research combining Large Language
Models with a new perspective of viewing text as an continuous space.
Related papers
- Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning, Semi-Supervised Training, and Advanced Optimization Techniques [0.0]
This research paper developed a novel approach to improve text generation in the context of joint Natural Language Generation (NLG) and Natural Language Understanding (NLU) learning.
The data is prepared by gathering and preprocessing annotated datasets, including cleaning, tokenization, stemming, and stop-word removal.
Transformer-based encoders and decoders, capturing long range dependencies and improving source-target sequence modelling.
Reinforcement learning with policy gradient techniques, semi-supervised training, improved attention mechanisms, and differentiable approximations are employed to fine-tune the models and handle complex linguistic tasks effectively.
arXiv Detail & Related papers (2024-10-17T12:43:49Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - A survey on text generation using generative adversarial networks [0.0]
This work presents a thorough review concerning recent studies and text generation advancements using Generative Adversarial Networks.
The usage of adversarial learning for text generation is promising as it provides alternatives to generate the so-called "natural" language.
arXiv Detail & Related papers (2022-12-20T17:54:08Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Collaborative Training of GANs in Continuous and Discrete Spaces for
Text Generation [21.435286755934534]
We propose a novel text GAN architecture that promotes the collaborative training of the continuous-space and discrete-space methods.
Our model substantially outperforms state-of-the-art text GANs with respect to quality, diversity, and global consistency.
arXiv Detail & Related papers (2020-10-16T07:51:16Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - PALM: Pre-training an Autoencoding&Autoregressive Language Model for
Context-conditioned Generation [92.7366819044397]
Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation.
This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus.
An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks.
arXiv Detail & Related papers (2020-04-14T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.