Textual Data Distributions: Kullback Leibler Textual Distributions
Contrasts on GPT-2 Generated Texts, with Supervised, Unsupervised Learning on
Vaccine & Market Topics & Sentiment
- URL: http://arxiv.org/abs/2107.02025v1
- Date: Tue, 15 Jun 2021 21:30:46 GMT
- Title: Textual Data Distributions: Kullback Leibler Textual Distributions
Contrasts on GPT-2 Generated Texts, with Supervised, Unsupervised Learning on
Vaccine & Market Topics & Sentiment
- Authors: Jim Samuel, Ratnakar Palle and Eduardo Correa Soares
- Abstract summary: Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP.
We develop a unique process driven variation of Kullback-Leibler divergence application to TDD, named KL Textual Distributions Contrasts.
This study thus identifies a unique approach for generating and validating TDD by topic and sentiment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Efficient textual data distributions (TDD) alignment and generation are open
research problems in textual analytics and NLP. It is presently difficult to
parsimoniously and methodologically confirm that two or more natural language
datasets belong to similar distributions, and to identify the extent to which
textual data possess alignment. This study focuses on addressing a segment of
the broader problem described above by applying multiple supervised and
unsupervised machine learning (ML) methods to explore the behavior of TDD by
(i) topical alignment, and (ii) by sentiment alignment. Furthermore we use
multiple text generation methods including fine-tuned GPT-2, to generate text
by topic and by sentiment. Finally we develop a unique process driven variation
of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual
Distributions Contrasts(KL-TDC) to identify the alignment of machine generated
textual corpora with naturally occurring textual corpora. This study thus
identifies a unique approach for generating and validating TDD by topic and
sentiment, which can be used to help address sparse data problems and other
research, practice and classroom situations in need of artificially generated
topic or sentiment aligned textual data.
Related papers
- Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking [11.022295941449919]
We develop INSPECTOR, a human-in-the-loop data inspection technique.
In a user study, INSPECTOR increases the number of texts with correct labels identified by 3X on a sentiment analysis task and by 4X on a hate speech detection task.
arXiv Detail & Related papers (2024-04-29T17:16:27Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences.
Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language.
Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z) - DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of
GPT-Generated Text [82.5469544192645]
We propose a novel training-free detection strategy called Divergent N-Gram Analysis (DNA-GPT)
By analyzing the differences between the original and new remaining parts through N-gram analysis, we unveil significant discrepancies between the distribution of machine-generated text and human-written text.
Results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and GPT-generated text.
arXiv Detail & Related papers (2023-05-27T03:58:29Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.