Plagiarism Detection in the Bengali Language: A Text Similarity-Based
Approach
- URL: http://arxiv.org/abs/2203.13430v1
- Date: Fri, 25 Mar 2022 03:11:00 GMT
- Title: Plagiarism Detection in the Bengali Language: A Text Similarity-Based
Approach
- Authors: Satyajit Ghosh, Aniruddha Ghosh, Bittaswer Ghosh, and Abhishek Roy
- Abstract summary: Plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India.
We have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus.
Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR.
We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts.
- Score: 0.866842899233181
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Plagiarism means taking another person's work and not giving any credit to
them for it. Plagiarism is one of the most serious problems in academia and
among researchers. Even though there are multiple tools available to detect
plagiarism in a document but most of them are domain-specific and designed to
work in English texts, but plagiarism is not limited to a single language only.
Bengali is the most widely spoken language of Bangladesh and the second most
spoken language in India with 300 million native speakers and 37 million
second-language speakers. Plagiarism detection requires a large corpus for
comparison. Bengali Literature has a history of 1300 years. Hence most Bengali
Literature books are not yet digitalized properly. As there was no such corpus
present for our purpose so we have collected Bengali Literature books from the
National Digital Library of India and with a comprehensive methodology
extracted texts from it and constructed our corpus. Our experimental results
find out average accuracy between 72.10 % - 79.89 % in text extraction using
OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have
built a web application for end-user and successfully tested it for Plagiarism
detection in Bengali texts. In future, we aim to construct a corpus with more
books for more accurate detection.
Related papers
- Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Vacaspati: A Diverse Corpus of Bangla Literature [4.555256739812733]
We build Vacaspati, a diverse corpus of Bangla literature.
It contains more than 11 million sentences and 115 million words.
We also built a word embedding model, Vac-FT, using FastText from Vacaspati as well as trained an Electra model, Vac-BERT, using the corpus.
arXiv Detail & Related papers (2023-07-11T07:32:12Z) - Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals [69.76097138157816]
Probabilistic text generators have been used to produce fake scientific papers for more than a decade.
Complex AI-powered generation techniques produce texts indistinguishable from that of humans.
Some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases.
arXiv Detail & Related papers (2021-07-12T20:47:08Z) - Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language.
Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z) - Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New
Datasets for Bengali-English Machine Translation [6.2418269277908065]
Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources.
We build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups.
With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs.
arXiv Detail & Related papers (2020-09-20T06:06:27Z) - Writer Identification Using Microblogging Texts for Social Media
Forensics [53.180678723280145]
We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes.
We test varying sized author sets and varying amounts of training/test texts per author.
arXiv Detail & Related papers (2020-07-31T00:23:18Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z) - Automatic Extraction of Bengali Root Verbs using Paninian Grammar [0.0]
The proposed system has been developed based on tense, person and morphological inflections of the verbs to find their root forms.
The accuracy of the output has been achieved 98% which is verified by a linguistic expert.
arXiv Detail & Related papers (2020-03-31T20:22:10Z) - Forensic Authorship Analysis of Microblogging Texts Using N-Grams and
Stylometric Features [63.48764893706088]
This work aims at identifying authors of tweet messages, which are limited to 280 characters.
We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user.
Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%.
arXiv Detail & Related papers (2020-03-24T19:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.