Bengali Fake Reviews: A Benchmark Dataset and Detection System
- URL: http://arxiv.org/abs/2308.01987v3
- Date: Sat, 4 May 2024 08:51:21 GMT
- Title: Bengali Fake Reviews: A Benchmark Dataset and Detection System
- Authors: G. M. Shahariar, Md. Tanvir Rouf Shawon, Faisal Muhammad Shah, Mohammad Shafiul Alam, Md. Shahriar Mahbub,
- Abstract summary: This paper introduces the Bengali Fake Review Detection (BFRD) dataset, the first publicly available dataset for identifying fake reviews in Bengali.
The dataset consists of 7710 non-fake and 1339 fake food-related reviews collected from social media posts.
To convert non-Bengali words in a review, a unique pipeline has been proposed that translates English words to their corresponding Bengali meaning and also back transliterates Romanized Bengali to Bengali.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The proliferation of fake reviews on various online platforms has created a major concern for both consumers and businesses. Such reviews can deceive customers and cause damage to the reputation of products or services, making it crucial to identify them. Although the detection of fake reviews has been extensively studied in English language, detecting fake reviews in non-English languages such as Bengali is still a relatively unexplored research area. This paper introduces the Bengali Fake Review Detection (BFRD) dataset, the first publicly available dataset for identifying fake reviews in Bengali. The dataset consists of 7710 non-fake and 1339 fake food-related reviews collected from social media posts. To convert non-Bengali words in a review, a unique pipeline has been proposed that translates English words to their corresponding Bengali meaning and also back transliterates Romanized Bengali to Bengali. We have conducted rigorous experimentation using multiple deep learning and pre-trained transformer language models to develop a reliable detection system. Finally, we propose a weighted ensemble model that combines four pre-trained transformers: BanglaBERT, BanglaBERT Base, BanglaBERT Large, and BanglaBERT Generator . According to the experiment results, the proposed ensemble model obtained a weighted F1-score of 0.9843 on 13390 reviews, including 1339 actual fake reviews and 5356 augmented fake reviews generated with the nlpaug library. The remaining 6695 reviews were randomly selected from the 7710 non-fake instances. The model achieved a 0.9558 weighted F1-score when the fake reviews were augmented using the bnaug library.
Related papers
- BanglishRev: A Large-Scale Bangla-English and Code-mixed Dataset of Product Reviews in E-Commerce [2.5874041837241304]
This work presents the largest e-commerce product review dataset to date for reviews written in Bengali, English, a mixture of both and Banglish, Bengali words written with English alphabets.
The dataset comprises of 1.74 million written reviews from 3.2 million ratings information collected from a total of 128k products being sold in online e-commerce platforms targeting the Bengali population.
It includes an extensive array of related metadata for each of the reviews including the rating given by the reviewer, date the review was posted and date of purchase, number of likes, dislikes, response from the seller, images associated with the review etc.
arXiv Detail & Related papers (2024-12-17T18:39:10Z) - Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models [0.0]
We propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali.
Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles.
We show the effectiveness of summarization and augmentation in the case of Bengali fake news detection.
arXiv Detail & Related papers (2023-07-13T14:50:55Z) - Bengali Fake Review Detection using Semi-supervised Generative
Adversarial Networks [0.0]
This paper investigates the potential of semi-supervised Generative Adversarial Networks (GANs) to fine-tune pretrained language models.
We have demonstrated that the proposed semi-supervised GAN-LM architecture is a viable solution in classifying Bengali fake reviews.
arXiv Detail & Related papers (2023-04-05T20:40:09Z) - Combat AI With AI: Counteract Machine-Generated Fake Restaurant Reviews
on Social Media [77.34726150561087]
We propose to leverage the high-quality elite Yelp reviews to generate fake reviews from the OpenAI GPT review creator.
We apply the model to predict non-elite reviews and identify the patterns across several dimensions.
We show that social media platforms are continuously challenged by machine-generated fake reviews.
arXiv Detail & Related papers (2023-02-10T19:40:10Z) - Online Fake Review Detection Using Supervised Machine Learning And BERT
Model [0.0]
We propose to use BERT (Bidirectional Representation from Transformers) model to extract word embeddings from texts (i.e. reviews)
The results indicate that the SVM classifiers outperform the others in terms of accuracy and f1-score with an accuracy of 87.81%.
arXiv Detail & Related papers (2023-01-09T09:40:56Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - Fake or Genuine? Contextualised Text Representation for Fake Review
Detection [0.4724825031148411]
This paper proposes a new ensemble model that employs transformer architecture to discover the hidden patterns in a sequence of fake reviews and detect them precisely.
The experimental results using semi-real benchmark datasets showed the superiority of the proposed model over state-of-the-art models.
arXiv Detail & Related papers (2021-12-29T00:54:47Z) - Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z) - Bangla Text Dataset and Exploratory Analysis for Online Harassment
Detection [0.0]
The data that has been made accessible in this article has been gathered and marked from the comments of people in public posts by celebrities, government officials, athletes on Facebook.
The dataset is compiled with the aim of developing the ability of machines to differentiate whether a comment is a bully expression or not.
arXiv Detail & Related papers (2021-02-04T08:35:18Z) - The Multilingual Amazon Reviews Corpus [46.84980931183582]
We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification.
MARC contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019.
The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language.
arXiv Detail & Related papers (2020-10-06T09:34:01Z) - Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof.
At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.