Arabic Tweet Act: A Weighted Ensemble Pre-Trained Transformer Model for
Classifying Arabic Speech Acts on Twitter
- URL: http://arxiv.org/abs/2401.17373v1
- Date: Tue, 30 Jan 2024 19:01:24 GMT
- Title: Arabic Tweet Act: A Weighted Ensemble Pre-Trained Transformer Model for
Classifying Arabic Speech Acts on Twitter
- Authors: Khadejaa Alshehri, Areej Alhothali and Nahed Alowidi
- Abstract summary: This paper proposes a Twitter dialectal Arabic speech act classification approach based on a transformer deep learning neural network.
We proposed a BERT based weighted ensemble learning approach to integrate the advantages of various BERT models in dialectal Arabic speech acts classification.
The results show that the best BERT model is araBERTv2-Twitter models with a macro-averaged F1 score and an accuracy of 0.73 and 0.84, respectively.
- Score: 0.32885740436059047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech acts are a speakers actions when performing an utterance within a
conversation, such as asking, recommending, greeting, or thanking someone,
expressing a thought, or making a suggestion. Understanding speech acts helps
interpret the intended meaning and actions behind a speakers or writers words.
This paper proposes a Twitter dialectal Arabic speech act classification
approach based on a transformer deep learning neural network. Twitter and
social media, are becoming more and more integrated into daily life. As a
result, they have evolved into a vital source of information that represents
the views and attitudes of their users. We proposed a BERT based weighted
ensemble learning approach to integrate the advantages of various BERT models
in dialectal Arabic speech acts classification. We compared the proposed model
against several variants of Arabic BERT models and sequence-based models. We
developed a dialectal Arabic tweet act dataset by annotating a subset of a
large existing Arabic sentiment analysis dataset (ASAD) based on six speech act
categories. We also evaluated the models on a previously developed Arabic Tweet
Act dataset (ArSAS). To overcome the class imbalance issue commonly observed in
speech act problems, a transformer-based data augmentation model was
implemented to generate an equal proportion of speech act categories. The
results show that the best BERT model is araBERTv2-Twitter models with a
macro-averaged F1 score and an accuracy of 0.73 and 0.84, respectively. The
performance improved using a BERT-based ensemble method with a 0.74 and 0.85
averaged F1 score and accuracy on our dataset, respectively.
Related papers
- Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models [0.0]
This paper presents an Arabic Alphabet Sign Language recognition approach, using deep learning methods in conjunction with transfer learning and transformer-based models.
We study the performance of the different variants on two publicly available datasets, namely ArSL2018 and AASL.
Experimental results present evidence that the suggested methodology can receive a high recognition accuracy, by up to 99.6% and 99.43% on ArSL2018 and AASL, respectively.
arXiv Detail & Related papers (2024-10-01T13:39:26Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Neural Models for Offensive Language Detection [0.0]
Offensive language detection is an ever-growing natural language processing (NLP) application.
We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis.
arXiv Detail & Related papers (2021-05-30T13:02:45Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Pre-Training BERT on Arabic Tweets: Practical Considerations [11.087099497830552]
We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing.
All are intended to support Arabic dialects and social media.
New models achieve new state-of-the-art results on several downstream tasks.
arXiv Detail & Related papers (2021-02-21T20:51:33Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.