Stylomech: Unveiling Authorship via Computational Stylometry in English and Romanized Sinhala
- URL: http://arxiv.org/abs/2501.09561v1
- Date: Thu, 16 Jan 2025 14:26:48 GMT
- Title: Stylomech: Unveiling Authorship via Computational Stylometry in English and Romanized Sinhala
- Authors: Nabeelah Faumi, Adeepa Gunathilake, Benura Wickramanayake, Deelaka Dias, TGDK Sumanathilaka,
- Abstract summary: Author's attribution in both English and Romanized Sinhala became a major requirement in the last few decades.
This research contributes significantly to the field of computational linguistics.
By expanding the scope of authorship attribution to encompass diverse linguistic contexts, the work contributes to fostering trust and accountability in digital communication.
- Score: 0.0
- License:
- Abstract: With the advent of Web 2.0, the development in social technology coupled with global communication systematically brought positive and negative impacts to society. Copyright claims and Author identification are deemed crucial as there has been a considerable amount of increase in content violation owing to the lack of proper ethics in society. The Author's attribution in both English and Romanized Sinhala became a major requirement in the last few decades. As an area largely unexplored, particularly within the context of Romanized Sinhala, the research contributes significantly to the field of computational linguistics. The proposed author attribution system offers a unique approach, allowing for the comparison of only two sets of text: suspect author and anonymous text, a departure from traditional methodologies which often rely on larger corpora. This work focuses on using the numerical representation of various pairs of the same and different authors allowing for, the model to train on these representations as opposed to text, this allows for it to apply to a multitude of authors and contexts, given that the suspected author text, and the anonymous text are of reasonable quality. By expanding the scope of authorship attribution to encompass diverse linguistic contexts, the work contributes to fostering trust and accountability in digital communication, especially in Sri Lanka. This research presents a pioneering approach to author attribution in both English and Romanized Sinhala, addressing a critical need for content verification and intellectual property rights enforcement in the digital age.
Related papers
- Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges [16.35265384114857]
The rapid advancements of Large Language Models (LLMs) have blurred the lines between human and machine authorship.
This literature review serves a roadmap for researchers and practitioners interested in understanding the state of the art in this rapidly evolving field.
arXiv Detail & Related papers (2024-08-16T17:58:49Z) - Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases.
Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding
over Small Language Models [53.83273575102087]
We propose an unsupervised inference-time approach to authorship obfuscation.
We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation.
Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM's APIs.
arXiv Detail & Related papers (2024-02-13T19:54:29Z) - Detection of Machine-Generated Text: Literature Survey [0.0]
This literature survey aims to compile and synthesize accomplishments and developments in the field of machine-generated text.
It also gives an overview of machine-generated text trends and explores the larger societal implications.
arXiv Detail & Related papers (2024-01-02T01:44:15Z) - SenteCon: Leveraging Lexicons to Learn Human-Interpretable Language
Representations [51.08119762844217]
SenteCon is a method for introducing human interpretability in deep language representations.
We show that SenteCon provides high-level interpretability at little to no cost to predictive performance on downstream tasks.
arXiv Detail & Related papers (2023-05-24T05:06:28Z) - Can You Fool AI by Doing a 180? $\unicode{x2013}$ A Case Study on
Authorship Analysis of Texts by Arata Osada [2.6954666679827137]
This paper is an attempt at answering a twofold question covering the areas of ethics and authorship analysis.
Firstly, we were interested in finding out whether it would be possible for an author identification system to correctly attribute works to authors if in the course of years they have undergone a major psychological transition.
Secondly, and from the point of view of the evolution of an author's ethical values, we checked what it would mean if the authorship attribution system encounters difficulties in detecting single authorship.
arXiv Detail & Related papers (2022-07-19T05:43:49Z) - Computational analyses of the topics, sentiments, literariness,
creativity and beauty of texts in a large Corpus of English Literature [0.0]
The Gutenberg Literary English Corpus (GLEC) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics.
We report the results of three studies providing i) topic and sentiment analyses for six text categories of GLEC and its >100 authors, ii) novel measures of semantic complexity as indices of the literariness, creativity and book beauty of the works in GLEC.
arXiv Detail & Related papers (2022-01-12T08:16:52Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set [0.0]
We show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features.
Our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.
arXiv Detail & Related papers (2020-10-21T07:39:55Z) - MedLatinEpi and MedLatinLit: Two Datasets for the Computational
Authorship Analysis of Medieval Latin Texts [72.16295267480838]
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis.
MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects.
arXiv Detail & Related papers (2020-06-22T14:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.