Related papers: Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing

Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing

URL: http://arxiv.org/abs/2506.21583v1
Date: Tue, 17 Jun 2025 06:31:04 GMT
Title: Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing
Authors: Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Ildar Batyrshin, Grigori Sidorov,
Abstract summary: This study introduces the first multi-class annotated dataset for Roman Urdu hope speech.<n>It explores the psychological foundations of hope and analyzes its linguistic patterns.<n>It proposes a custom attention-based transformer model for optimized the syntactic and semantic variability of Roman Urdu.
Score: 6.34691005108325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hope is a positive emotional state involving the expectation of favorable future outcomes, while hope speech refers to communication that promotes optimism, resilience, and support, particularly in adverse contexts. Although hope speech detection has gained attention in Natural Language Processing (NLP), existing research mainly focuses on high-resource languages and standardized scripts, often overlooking informal and underrepresented forms such as Roman Urdu. To the best of our knowledge, this is the first study to address hope speech detection in code-mixed Roman Urdu by introducing a carefully annotated dataset, thereby filling a critical gap in inclusive NLP research for low-resource, informal language varieties. This study makes four key contributions: (1) it introduces the first multi-class annotated dataset for Roman Urdu hope speech, comprising Generalized Hope, Realistic Hope, Unrealistic Hope, and Not Hope categories; (2) it explores the psychological foundations of hope and analyzes its linguistic patterns in code-mixed Roman Urdu to inform dataset development; (3) it proposes a custom attention-based transformer model optimized for the syntactic and semantic variability of Roman Urdu, evaluated using 5-fold cross-validation; and (4) it verifies the statistical significance of performance gains using a t-test. The proposed model, XLM-R, achieves the best performance with a cross-validation score of 0.78, outperforming the baseline SVM (0.75) and BiLSTM (0.76), with gains of 4% and 2.63% respectively.

Related papers

CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language [0.5937476291232802]
Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text.<n>This research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification.<n>We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset.
arXiv Detail & Related papers (2025-05-25T07:42:32Z)
Enhanced Urdu Intent Detection with Large Language Models and Prototype-Informed Predictive Pipelines [5.191443390565865]
This paper introduces a unique contrastive learning approach that leverages unlabeled Urdu data to re-train pre-trained language models.<n>It reaps the combined potential of pre-trained LLMs and the prototype-informed attention mechanism to create an end-to-end intent detection pipeline.<n>Under the paradigm of proposed predictive pipeline, it explores the potential of 6 distinct language models and 13 distinct similarity computation methods.
arXiv Detail & Related papers (2025-05-08T08:38:40Z)
Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models [0.6554326244334868]
Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored.<n>We propose a transformer-based approach using the m2m100 multilingual translation model.<n>Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu->Roman-Urdu and 97.44 for Roman-Urdu->Urdu.
arXiv Detail & Related papers (2025-03-27T14:18:50Z)
A New Method for Cross-Lingual-based Semantic Role Labeling [5.992526851963307]
A deep learning algorithm is proposed to train semantic role labeling in English and Persian. The results show significant improvements compared to Niksirt et al.'s model. The development of cross-lingual methods for semantic role labeling holds promise.
arXiv Detail & Related papers (2024-08-28T16:06:12Z)
PolyHope: Two-Level Hope Speech Detection from Tweets [68.8204255655161]
Despite its importance, hope has rarely been studied as a social media analysis task. This paper presents a hope speech dataset that classifies each tweet first into "Hope" and "Not Hope" English tweets in the first half of 2022 were collected to build this dataset.
arXiv Detail & Related papers (2022-10-25T16:34:03Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
From Good to Best: Two-Stage Training for Cross-lingual Machine Reading Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance. The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer. The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z)
Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages. The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z)
Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
A Precisely Xtreme-Multi Channel Hybrid Approach For Roman Urdu Sentiment Analysis [0.8812173669205371]
This paper provides 3 neural word embeddings prepared using most widely used approaches namely Word2vec, FastText, and Glove. Considering the lack of publicly available benchmark datasets, it provides a first-ever Roman Urdu dataset which consists of 3241 sentiments annotated against positive, negative and neutral classes. It proposes a novel precisely extreme multi-channel hybrid methodology which outperforms state-of-the-art adapted machine and deep learning approaches by the figure of 9%, and 4% in terms of F1-score.
arXiv Detail & Related papers (2020-03-11T04:08:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.