Authorship Attribution in Bangla Literature (AABL) via Transfer Learning
using ULMFiT
- URL: http://arxiv.org/abs/2403.05519v1
- Date: Fri, 8 Mar 2024 18:42:59 GMT
- Title: Authorship Attribution in Bangla Literature (AABL) via Transfer Learning
using ULMFiT
- Authors: Aisha Khatun, Anisur Rahman, Md Saiful Islam, Hemayet Ahmed Chowdhury,
Ayesha Tasnim
- Abstract summary: Authorship Attribution is the task of creating an appropriate characterization of text to identify the original author of a given piece of text.
Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field.
Existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author.
- Score: 0.6919386619690135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Authorship Attribution is the task of creating an appropriate
characterization of text that captures the authors' writing style to identify
the original author of a given piece of text. With increased anonymity on the
internet, this task has become increasingly crucial in various security and
plagiarism detection fields. Despite significant advancements in other
languages such as English, Spanish, and Chinese, Bangla lacks comprehensive
research in this field due to its complex linguistic feature and sentence
structure. Moreover, existing systems are not scalable when the number of
author increases, and the performance drops for small number of samples per
author. In this paper, we propose the use of Average-Stochastic Gradient
Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an
effective transfer learning approach that addresses the problem of complex
linguistic features extraction and scalability for authorship attribution in
Bangla Literature (AABL). We analyze the effect of different tokenization, such
as word, sub-word, and character level tokenization, and demonstrate the
effectiveness of these tokenizations in the proposed model. Moreover, we
introduce the publicly available Bangla Authorship Attribution Dataset of 16
authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to
solve the standard dataset scarcity problem and release six variations of
pre-trained language models for use in any Bangla NLP downstream task. For
evaluation, we used our developed BAAD16 dataset as well as other publicly
available datasets. Empirically, our proposed model outperformed
state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset.
Furthermore, we showed that the proposed system scales much better even with an
increasing number of authors, and performance remains steady despite few
training samples.
Related papers
- A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named
Entity Recognition [67.96794382040547]
$LLM-DA$ is a novel data augmentation technique based on large language models (LLMs) for the few-shot NER task.
Our approach involves employing 14 contextual rewriting strategies, designing entity replacements of the same type, and incorporating noise injection to enhance robustness.
arXiv Detail & Related papers (2024-02-22T14:19:56Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - A Unified Neural Network Model for Readability Assessment with Feature
Projection and Length-Balanced Loss [17.213602354715956]
We propose a BERT-based model with feature projection and length-balanced loss for readability assessment.
Our model achieves state-of-the-art performances on two English benchmark datasets and one dataset of Chinese textbooks.
arXiv Detail & Related papers (2022-10-19T05:33:27Z) - Transferring BERT-like Transformers' Knowledge for Authorship
Verification [8.443350618722562]
We study the effectiveness of several BERT-like transformers for the task of authorship verification.
We provide new splits for PAN-2020, where training and test data are sampled from disjoint topics or authors.
We show that those splits can enhance the models' capability to transfer knowledge over a new, significantly different dataset.
arXiv Detail & Related papers (2021-12-09T18:57:29Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Offline Handwritten Chinese Text Recognition with Convolutional Neural
Networks [5.984124397831814]
In this paper, we build the models using only the convolutional neural networks and use CTC as the loss function.
We achieve 6.81% character error rate (CER) on the ICDAR 2013 competition set, which is the best published result without language model correction.
arXiv Detail & Related papers (2020-06-28T14:34:38Z) - Authorship Attribution in Bangla literature using Character-level CNN [0.5243460995467893]
We investigate the effectiveness of character-level signals in Authorship Attribution of Bangla Literature.
Time and memory efficiency of the proposed model is much higher than the word level counterparts.
It is seen that the performance is improved by up to 10% on pre-training.
arXiv Detail & Related papers (2020-01-11T14:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.