Leveraging GPT-2 for Classifying Spam Reviews with Limited Labeled Data
via Adversarial Training
- URL: http://arxiv.org/abs/2012.13400v1
- Date: Thu, 24 Dec 2020 18:59:51 GMT
- Title: Leveraging GPT-2 for Classifying Spam Reviews with Limited Labeled Data
via Adversarial Training
- Authors: Athirai A. Irissappane, Hanfei Yu, Yankun Shen, Anubha Agrawal, Gray
Stanton
- Abstract summary: We propose an adversarial training mechanism for classifying opinion spam with limited labeled data and a large set of unlabeled data.
Experiments on TripAdvisor and YelpZip datasets show that the proposed model outperforms state-of-the-art techniques by at least 7% in terms of accuracy when labeled data is limited.
- Score: 1.8899300124593648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online reviews are a vital source of information when purchasing a service or
a product. Opinion spammers manipulate these reviews, deliberately altering the
overall perception of the service. Though there exists a corpus of online
reviews, only a few have been labeled as spam or non-spam, making it difficult
to train spam detection models. We propose an adversarial training mechanism
leveraging the capabilities of Generative Pre-Training 2 (GPT-2) for
classifying opinion spam with limited labeled data and a large set of unlabeled
data. Experiments on TripAdvisor and YelpZip datasets show that the proposed
model outperforms state-of-the-art techniques by at least 7% in terms of
accuracy when labeled data is limited. The proposed model can also generate
synthetic spam/non-spam reviews with reasonable perplexity, thereby, providing
additional labeled data during training.
Related papers
- Online detection and infographic explanation of spam reviews with data drift adaptation [4.278181795494584]
This paper proposes an online solution for identifying and explaining spam reviews, incorporating data drift adaptation.
It integrates (i) incremental profiling, (ii) data drift detection & adaptation, and (iii) identification of spam reviews employing Machine Learning.
The best results obtained reached up to 87 % spam F-measure.
arXiv Detail & Related papers (2024-06-21T10:35:46Z) - Metadata Integration for Spam Reviews Detection on Vietnamese E-commerce Websites [0.0]
We introduce the ViSpamReviews v2 dataset, which includes metadata of reviews.
We propose a novel approach to simultaneously integrate both textual and categorical attributes into the classification model.
arXiv Detail & Related papers (2024-05-22T02:19:13Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora.
For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination.
We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - Spam Review Detection Using Deep Learning [0.0]
In many online sites, there are options for posting reviews, and thus creating scopes for fake paid reviews or untruthful reviews.
These concocted reviews can mislead the general public and put them in a perplexity whether to believe the review or not.
Prominent machine learning techniques have been introduced to solve the problem of spam review detection.
arXiv Detail & Related papers (2022-11-03T09:41:48Z) - Adversarial Training with Complementary Labels: On the Benefit of
Gradually Informative Attacks [119.38992029332883]
Adversarial training with imperfect supervision is significant but receives limited attention.
We propose a new learning strategy using gradually informative attacks.
Experiments are conducted to demonstrate the effectiveness of our method on a range of benchmarked datasets.
arXiv Detail & Related papers (2022-11-01T04:26:45Z) - Opinion Spam Detection: A New Approach Using Machine Learning and
Network-Based Algorithms [2.062593640149623]
Online reviews play a crucial role in helping consumers evaluate and compare products and services.
Fake reviews (opinion spam) are becoming more prevalent and negatively impacting customers and service providers.
We propose a new method for classifying reviewers as spammers or benign, combining machine learning with a message-passing algorithm.
arXiv Detail & Related papers (2022-05-26T15:27:46Z) - Disentangling Sampling and Labeling Bias for Learning in Large-Output
Spaces [64.23172847182109]
We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels.
We provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance.
arXiv Detail & Related papers (2021-05-12T15:40:13Z) - A Robust Opinion Spam Detection Method Against Malicious Attackers in
Social Media [0.0]
It is a way a smart spammer can deceive the system in a manner in which he can continue generating spams without the fear of being detected and blocked by the system.
A robust graph-based spam detection method is proposed.
arXiv Detail & Related papers (2020-08-19T19:54:44Z) - Semi-Automatic Data Annotation guided by Feature Space Projection [117.9296191012968]
We present a semi-automatic data annotation approach based on suitable feature space projection and semi-supervised label estimation.
We validate our method on the popular MNIST dataset and on images of human intestinal parasites with and without fecal impurities.
Our results demonstrate the added-value of visual analytics tools that combine complementary abilities of humans and machines for more effective machine learning.
arXiv Detail & Related papers (2020-07-27T17:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.