Fake News Detection in Spanish Using Deep Learning Techniques
- URL: http://arxiv.org/abs/2110.06461v1
- Date: Wed, 13 Oct 2021 02:56:16 GMT
- Title: Fake News Detection in Spanish Using Deep Learning Techniques
- Authors: Kevin Mart\'inez-Gallego, Andr\'es M. \'Alvarez-Ortiz, Juli\'an D.
Arias-Londo\~no
- Abstract summary: This paper addresses the problem of fake news detection in Spanish using Machine Learning techniques.
It is fundamentally the same problem tackled for the English language.
There is not a significant amount of publicly available and adequately labeled fake news in Spanish to effectively train a Machine Learning model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of fake news detection in Spanish using
Machine Learning techniques. It is fundamentally the same problem tackled for
the English language; however, there is not a significant amount of publicly
available and adequately labeled fake news in Spanish to effectively train a
Machine Learning model, similarly to those proposed for the English language.
Therefore, this work explores different training strategies and architectures
to establish a baseline for further research in this area. Four datasets were
used, two in English and two in Spanish, and four experimental schemes were
tested, including a baseline with classical Machine Learning models, trained
and validated using a small dataset in Spanish. The remaining schemes include
state-of-the-art Deep Learning models trained (or fine-tuned) and validated in
English, trained and validated in Spanish, and fitted in English and validated
with automatic translated Spanish sentences. The Deep Learning architectures
were built on top of different pre-trained Word Embedding representations,
including GloVe, ELMo, BERT, and BETO (a BERT version trained on a large corpus
in Spanish). According to the results, the best strategy was a combination of a
pre-trained BETO model and a Recurrent Neural Network based on LSTM layers,
yielding an accuracy of up to 80%; nonetheless, a baseline model using a Random
Forest estimator obtained similar outcomes. Additionally, the translation
strategy did not yield acceptable results because of the propagation error;
there was also observed a significant difference in models performance when
trained in English or Spanish, mainly attributable to the number of samples
available for each language.
Related papers
- Spanish Pre-trained BERT Model and Evaluation Data [0.0]
We present a BERT-based language model pre-trained exclusively on Spanish data.
We also compiled several tasks specifically for the Spanish language in a single repository.
We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.
arXiv Detail & Related papers (2023-08-06T00:16:04Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - HerBERT: Efficiently Pretrained Transformer-based Language Model for
Polish [4.473327661758546]
This paper presents the first ablation study focused on Polish, which, unlike the isolating English language, is a fusional language.
We design and thoroughly evaluate a pretraining procedure of transferring knowledge from multilingual to monolingual BERT-based models.
Based on the proposed procedure, a Polish BERT-based language model -- HerBERT -- is trained.
arXiv Detail & Related papers (2021-05-04T20:16:17Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.