Factorization of Fact-Checks for Low Resource Indian Languages
- URL: http://arxiv.org/abs/2102.11276v1
- Date: Tue, 23 Feb 2021 16:47:41 GMT
- Title: Factorization of Fact-Checks for Low Resource Indian Languages
- Authors: Shivangi Singhal, Rajiv Ratn Shah, Ponnurangam Kumaraguru
- Abstract summary: We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
- Score: 44.94080515860928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advancement in technology and accessibility of internet to each
individual is revolutionizing the real time information. The liberty to express
your thoughts without passing through any credibility check is leading to
dissemination of fake content in the ecosystem. It can have disastrous effects
on both individuals and society as a whole. The amplification of fake news is
becoming rampant in India too. Debunked information often gets republished with
a replacement description, claiming it to depict some different incidence. To
curb such fabricated stories, it is necessary to investigate such deduplicates
and false claims made in public. The majority of studies on automatic
fact-checking and fake news detection is restricted to English only. But for a
country like India where only 10% of the literate population speak English,
role of regional languages in spreading falsity cannot be undermined. In this
paper, we introduce FactDRIL: the first large scale multilingual Fact-checking
Dataset for Regional Indian Languages. We collect an exhaustive dataset across
7 months covering 11 low-resource languages. Our propose dataset consists of
9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222
samples are distributed across various regional languages, i.e. Bangla,
Marathi, Malayalam, Telugu, Tamil, Oriya, Assamese, Punjabi, Urdu, Sinhala and
Burmese. We also present the detailed characterization of three M's
(multi-lingual, multi-media, multi-domain) in the FactDRIL accompanied with the
complete list of other varied attributes making it a unique dataset to study.
Lastly, we present some potential use cases of the dataset. We expect this
dataset will be a valuable resource and serve as a starting point to fight
proliferation of fake news in low resource languages.
Related papers
- Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection [7.533158533458647]
Ax-to-Grind Urdu is the first publicly available dataset for fake and real news in Urdu.
It constitutes 10,083 fake and real news on fifteen domains from leading and authentic Urdu newspapers and news channel websites in Pakistan and India.
We benchmark the dataset with an ensemble model of mBERT,XLNet, and XLM RoBERTa.
arXiv Detail & Related papers (2024-03-20T23:21:35Z) - Mukhyansh: A Headline Generation Dataset for Indic Languages [4.583536403673757]
Mukhyansh is an extensive multilingual dataset, tailored for Indian language headline generation.
Comprising over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages.
Mukhyansh outperforms all other models, achieving an average ROUGE-L score of 31.43 across all 8 languages.
arXiv Detail & Related papers (2023-11-29T15:49:24Z) - MalFake: A Multimodal Fake News Identification for Malayalam using
Recurrent Neural Networks and VGG-16 [0.0]
Multimodal approaches are more accurate in detecting fake news in Malayalam.
Models trained with more than one modality typically outperform models taught with only one modality.
arXiv Detail & Related papers (2023-10-27T16:51:29Z) - Lost in Translation -- Multilingual Misinformation and its Evolution [52.07628580627591]
This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of over 250,000 unique fact-checks spanning 95 languages.
We find that while the majority of misinformation claims are only fact-checked once, 11.7%, corresponding to more than 21,000 claims, are checked multiple times.
Using fact-checks as a proxy for the spread of misinformation, we find 33% of repeated claims cross linguistic boundaries.
arXiv Detail & Related papers (2023-10-27T12:21:55Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - Cross-lingual COVID-19 Fake News Detection [54.125563009333995]
We make the first attempt to detect COVID-19 misinformation in a low-resource language (Chinese) only using the fact-checked news in a high-resource language (English)
We propose a deep learning framework named CrossFake to jointly encode the cross-lingual news body texts and capture the news content.
Empirical results on our dataset demonstrate the effectiveness of CrossFake under the cross-lingual setting.
arXiv Detail & Related papers (2021-10-13T04:44:02Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet
Detection [4.411285005377513]
We propose an approach to detect fake news about COVID-19 early on from social media, such as tweets, for multiple Indic-Languages besides English.
To expand our approach to multiple Indic languages, we resort to mBERT based model which is fine-tuned over created dataset in Hindi and Bengali.
Our approach reaches around 89% F-Score in fake tweet detection which supercedes the state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2020-10-14T09:37:51Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.