FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis
- URL: http://arxiv.org/abs/2411.04604v1
- Date: Thu, 07 Nov 2024 10:39:10 GMT
- Title: FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis
- Authors: Amin Abdedaiem, Abdelhalim Hafedh Dahou, Mohamed Amine Cheragui, Brigitte Mathiak,
- Abstract summary: The Algerian dialect (AD) faces challenges due to the absence of annotated corpora.
This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA.
- Score: 0.0
- License:
- Abstract: In the context of low-resource languages, the Algerian dialect (AD) faces challenges due to the absence of annotated corpora, hindering its effective processing, notably in Machine Learning (ML) applications reliant on corpora for training and assessment. This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA. This corpus comprises 10,087 sentences, encompassing over 19,497 unique words in AD, and addresses the significant lack of linguistic resources in the language and covers seven distinct domains. We propose an annotation scheme for FN detection and SA, detailing the data collection, cleaning, and labelling process. Remarkable Inter-Annotator Agreement indicates that the annotation scheme produces consistent annotations of high quality. Subsequent classification experiments using BERT-based models and ML models are presented, demonstrate promising results and highlight avenues for further research. The dataset is made freely available on GitHub (https://github.com/amincoding/FASSILA) to facilitate future advancements in the field.
Related papers
- A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI [0.0]
South Africa and the DRC present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba.
This study develops a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu.
A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models trained to predict sentiment.
arXiv Detail & Related papers (2024-11-06T23:41:18Z) - RAAMove: A Corpus for Analyzing Moves in Research Article Abstracts [9.457460355411582]
RAAMove is a comprehensive corpus dedicated to the annotation of move structures in Research Article (RA) abstracts.
The corpus is constructed through two stages: first, expert annotators manually annotate high-quality data; then, based on the human-annotated data, a BERT-based model is employed for automatic annotation.
The result is a large-scale and high-quality corpus comprising 33,988 annotated instances.
arXiv Detail & Related papers (2024-03-23T15:43:30Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - FRASIMED: a Clinical French Annotated Resource Produced through
Crosslingual BERT-Based Annotation Projection [0.6116681488656472]
This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection.
We present the creation of French Annotated Resource with Semantic Information for Medical Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French.
arXiv Detail & Related papers (2023-09-19T17:17:28Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - WLASL-LEX: a Dataset for Recognising Phonological Properties in American
Sign Language [2.814213966364155]
We build a large-scale dataset of American Sign Language signs annotated with six different phonological properties.
We investigate whether data-driven end-to-end and feature-based approaches can be optimised to automatically recognise these properties.
arXiv Detail & Related papers (2022-03-11T17:21:24Z) - An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings [18.282632348274756]
Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech.
We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system.
arXiv Detail & Related papers (2021-10-14T11:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.