Authorship Attribution in Bangla literature using Character-level CNN
- URL: http://arxiv.org/abs/2001.05316v1
- Date: Sat, 11 Jan 2020 14:54:04 GMT
- Title: Authorship Attribution in Bangla literature using Character-level CNN
- Authors: Aisha Khatun, Anisur Rahman, Md. Saiful Islam, Marium-E-Jannat
- Abstract summary: We investigate the effectiveness of character-level signals in Authorship Attribution of Bangla Literature.
Time and memory efficiency of the proposed model is much higher than the word level counterparts.
It is seen that the performance is improved by up to 10% on pre-training.
- Score: 0.5243460995467893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Characters are the smallest unit of text that can extract stylometric signals
to determine the author of a text. In this paper, we investigate the
effectiveness of character-level signals in Authorship Attribution of Bangla
Literature and show that the results are promising but improvable. The time and
memory efficiency of the proposed model is much higher than the word level
counterparts but accuracy is 2-5% less than the best performing word-level
models. Comparison of various word-based models is performed and shown that the
proposed model performs increasingly better with larger datasets. We also
analyze the effect of pre-training character embedding of diverse Bangla
character set in authorship attribution. It is seen that the performance is
improved by up to 10% on pre-training. We used 2 datasets from 6 to 14 authors,
balancing them before training and compare the results.
Related papers
- Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Authorship Attribution in Bangla Literature (AABL) via Transfer Learning
using ULMFiT [0.6919386619690135]
Authorship Attribution is the task of creating an appropriate characterization of text to identify the original author of a given piece of text.
Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field.
Existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author.
arXiv Detail & Related papers (2024-03-08T18:42:59Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Whodunit? Learning to Contrast for Authorship Attribution [22.37948005237967]
Authorship attribution is the task of identifying the author of a given text.
We propose to fine-tune pre-trained language representations using a combination of contrastive learning and supervised learning.
We show that Contra-X advances the state-of-the-art on multiple human and machine authorship attribution benchmarks.
arXiv Detail & Related papers (2022-09-23T23:45:08Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Analyzing the Use of Character-Level Translation with Sparse and Noisy
Datasets [20.50917929755389]
We find that character-level models cut the number of untranslated words by over 40% when applied to sparse and noisy datasets.
We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality.
Neither word-nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.
arXiv Detail & Related papers (2021-09-27T07:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.