Reddit is all you need: Authorship profiling for Romanian
- URL: http://arxiv.org/abs/2410.09907v1
- Date: Sun, 13 Oct 2024 16:27:31 GMT
- Title: Reddit is all you need: Authorship profiling for Romanian
- Authors: Ecaterina Ştefănescu, Alexandru-Iulius Jerpelea,
- Abstract summary: Authorship profiling is the process of identifying an author's characteristics based on their writings.
In this paper, we introduce a corpus of short texts in the Romanian language, annotated with certain author characteristic keywords.
- Score: 49.1574468325115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Authorship profiling is the process of identifying an author's characteristics based on their writings. This centuries old problem has become more intriguing especially with recent developments in Natural Language Processing (NLP). In this paper, we introduce a corpus of short texts in the Romanian language, annotated with certain author characteristic keywords; to our knowledge, the first of its kind. In order to do this, we exploit a social media platform called Reddit. We leverage its thematic community-based structure (subreddits structure), which offers information about the author's background. We infer an user's demographic and some broad personal traits, such as age category, employment status, interests, and social orientation based on the subreddit and other cues. We thus obtain a 23k+ samples corpus, extracted from 100+ Romanian subreddits. We analyse our dataset, and finally, we fine-tune and evaluate Large Language Models (LLMs) to prove baselines capabilities for authorship profiling using the corpus, indicating the need for further research in the field. We publicly release all our resources.
Related papers
- BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering [8.20929362102942]
Author profiling is the task of inferring characteristics about individuals by analyzing content they share.
We propose a new method for author profiling which aims at distinguishing relevant from irrelevant content first, followed by the actual user profiling only with relevant data.
We evaluate our method for Big Five personality trait prediction on two Twitter corpora.
arXiv Detail & Related papers (2024-09-06T08:43:10Z) - ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts [0.0]
We present the development and deployment of a linguistic corpus from Twitter posts in English.
The main goal was to create a fully annotated English corpus for linguistic analysis.
We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams.
arXiv Detail & Related papers (2024-07-22T04:48:04Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Constructing Colloquial Dataset for Persian Sentiment Analysis of Social
Microblogs [0.0]
This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way.
Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram.
Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts.
arXiv Detail & Related papers (2023-06-22T05:51:22Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Urdu Speech and Text Based Sentiment Analyzer [1.4630964945453113]
This work presented a new multi-class Urdu dataset based on user evaluations.
Our proposed dataset includes 10,000 reviews that have been carefully classified into two categories by human experts: positive, negative.
Five different lexicon- and rule-based algorithms including Naivebayes, Stanza, Textblob, Vader, and Flair are employed and the experimental results show that Flair with an accuracy of 70% outperforms other tested algorithms.
arXiv Detail & Related papers (2022-07-19T10:11:22Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - How to Train Your Agent to Read and Write [52.24605794920856]
Reading and writing research papers is one of the most privileged abilities that a qualified researcher should master.
It would be fascinating if we could train an intelligent agent to help people read and summarize papers, and perhaps even discover and exploit the potential knowledge clues to write novel papers.
We propose a Deep ReAder-Writer (DRAW) network, which consists of a textitReader that can extract knowledge graphs (KGs) from input paragraphs and discover potential knowledge, a graph-to-text textitWriter that generates a novel paragraph, and a textit
arXiv Detail & Related papers (2021-01-04T12:22:04Z) - A Framework for Generating Annotated Social Media Corpora with
Demographics, Stance, Civility, and Topicality [0.0]
We introduce a framework for annotating a social media text corpora for various categories.
We use a case study of a Facebook comment corpora on student loan discussion which was annotated for gender, military affiliation, age-group, political leaning, race, stance, topicalilty, neoliberlistic views and civility of the comment.
arXiv Detail & Related papers (2020-12-10T04:06:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.