Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection
- URL: http://arxiv.org/abs/2403.14037v1
- Date: Wed, 20 Mar 2024 23:21:35 GMT
- Title: Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection
- Authors: Sheetal Harris, Jinshuo Liu, Hassan Jalil Hadi, Yue Cao,
- Abstract summary: Ax-to-Grind Urdu is the first publicly available dataset for fake and real news in Urdu.
It constitutes 10,083 fake and real news on fifteen domains from leading and authentic Urdu newspapers and news channel websites in Pakistan and India.
We benchmark the dataset with an ensemble model of mBERT,XLNet, and XLM RoBERTa.
- Score: 7.533158533458647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Misinformation can seriously impact society, affecting anything from public opinion to institutional confidence and the political horizon of a state. Fake News (FN) proliferation on online websites and Online Social Networks (OSNs) has increased profusely. Various fact-checking websites include news in English and barely provide information about FN in regional languages. Thus the Urdu FN purveyors cannot be discerned using factchecking portals. SOTA approaches for Fake News Detection (FND) count upon appropriately labelled and large datasets. FND in regional and resource-constrained languages lags due to the lack of limited-sized datasets and legitimate lexical resources. The previous datasets for Urdu FND are limited-sized, domain-restricted, publicly unavailable and not manually verified where the news is translated from English into Urdu. In this paper, we curate and contribute the first largest publicly available dataset for Urdu FND, Ax-to-Grind Urdu, to bridge the identified gaps and limitations of existing Urdu datasets in the literature. It constitutes 10,083 fake and real news on fifteen domains collected from leading and authentic Urdu newspapers and news channel websites in Pakistan and India. FN for the Ax-to-Grind dataset is collected from websites and crowdsourcing. The dataset contains news items in Urdu from the year 2017 to the year 2023. Expert journalists annotated the dataset. We benchmark the dataset with an ensemble model of mBERT,XLNet, and XLM RoBERTa. The selected models are originally trained on multilingual large corpora. The results of the proposed model are based on performance metrics, F1-score, accuracy, precision, recall and MCC value.
Related papers
- MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages [0.4062349563818079]
We introduce the Multimodal Multilingual dataset for Indic Fake News Detection (MMIFND)
This meticulously curated dataset consists of 28,085 instances distributed across Hindi, Bengali, Marathi, Malayalam, Tamil, Gujarati and Punjabi.
We propose the Multimodal Caption-aware framework for Fake News Detection (MMCFND)
arXiv Detail & Related papers (2024-10-14T11:59:33Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu [62.6928395368204]
This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language.
The goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing.
The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business.
arXiv Detail & Related papers (2022-07-25T03:46:51Z) - Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021 [55.41644538483948]
The goal of the shared task is to motivate the community to come up with efficient methods for solving this vital problem.
The training set contains 1300 annotated news articles -- 750 real news, 550 fake news, while the testing set contains 300 news articles -- 200 real, 100 fake news.
The best performing system obtained an F1-macro score of 0.679, which is lower than the past year's best result of 0.907 F1-macro.
arXiv Detail & Related papers (2022-07-11T18:58:36Z) - CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking [55.75590135151682]
CHEF is the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims.
The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet.
arXiv Detail & Related papers (2022-06-06T09:11:03Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z) - No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet
Detection [4.411285005377513]
We propose an approach to detect fake news about COVID-19 early on from social media, such as tweets, for multiple Indic-Languages besides English.
To expand our approach to multiple Indic languages, we resort to mBERT based model which is fine-tuned over created dataset in Hindi and Bengali.
Our approach reaches around 89% F-Score in fake tweet detection which supercedes the state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2020-10-14T09:37:51Z) - Efficient Urdu Caption Generation using Attention based LSTM [0.0]
Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India.
We develop an attention-based deep learning model using techniques of sequence modeling specialized for the Urdu language.
We evaluate our proposed technique on this dataset and show that it can achieve a BLEU score of 0.83 in the Urdu language.
arXiv Detail & Related papers (2020-08-02T17:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.