Integrating Bidirectional Long Short-Term Memory with Subword Embedding
for Authorship Attribution
- URL: http://arxiv.org/abs/2306.14933v1
- Date: Mon, 26 Jun 2023 11:35:47 GMT
- Title: Integrating Bidirectional Long Short-Term Memory with Subword Embedding
for Authorship Attribution
- Authors: Abiodun Modupe, Turgay Celik, Vukosi Marivate and Oludayo O. Olugbara
- Abstract summary: Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution.
The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
- Score: 2.3429306644730854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of unveiling the author of a given text document from multiple
candidate authors is called authorship attribution. Manifold word-based
stylistic markers have been successfully used in deep learning methods to deal
with the intrinsic problem of authorship attribution. Unfortunately, the
performance of word-based authorship attribution systems is limited by the
vocabulary of the training corpus. Literature has recommended character-based
stylistic markers as an alternative to overcome the hidden word problem.
However, character-based methods often fail to capture the sequential
relationship of words in texts which is a chasm for further improvement. The
question addressed in this paper is whether it is possible to address the
ambiguity of hidden words in text documents while preserving the sequential
context of words. Consequently, a method based on bidirectional long short-term
memory (BLSTM) with a 2-dimensional convolutional neural network (CNN) is
proposed to capture sequential writing styles for authorship attribution. The
BLSTM was used to obtain the sequential relationship among characteristics
using subword information. The 2-dimensional CNN was applied to understand the
local syntactical position of the style from unlabeled input text. The proposed
method was experimentally evaluated against numerous state-of-the-art methods
across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
Experimental results indicate accuracy improvement of 1.07\%, and 0.96\% on
CCAT50 and Twitter, respectively, and produce comparable results on the
remaining datasets.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks [2.3624125155742064]
We propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources.
We design a preprocessing pipeline for the filtration of unwanted text from crawled data.
The cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms.
arXiv Detail & Related papers (2024-08-28T11:36:29Z) - TransLIST: A Transformer-Based Linguistically Informed Sanskrit
Tokenizer [11.608920658638976]
Sanskrit Word algorithm (SWS) is essential in making digitized texts available and in deploying downstream tasks.
We propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST)
TransLIST encodes the character input along with latent-word information, which takes into account the sandhi phenomenon specific to SWS.
arXiv Detail & Related papers (2022-10-21T06:15:40Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Disentangling Homophemes in Lip Reading using Perplexity Analysis [10.262299768603894]
This paper proposes a new application for the Generative Pre-Training transformer.
It serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences.
The network uses the search for optimal perplexity to perform the viseme-to-word mapping.
arXiv Detail & Related papers (2020-11-28T12:12:17Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - MuSeM: Detecting Incongruent News Headlines using Mutual Attentive
Semantic Matching [7.608480381965392]
Measuring the congruence between two texts has several useful applications, such as detecting deceptive and misleading news headlines on the web.
This paper proposes a method that uses inter-mutual attention-based semantic matching between the original and synthetically generated headlines.
We observe that the proposed method outperforms prior arts significantly for two publicly available datasets.
arXiv Detail & Related papers (2020-10-07T19:19:42Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.