Accenture at CheckThat! 2021: Interesting claim identification and
ranking with contextually sensitive lexical training data augmentation
- URL: http://arxiv.org/abs/2107.05684v1
- Date: Mon, 12 Jul 2021 18:46:47 GMT
- Title: Accenture at CheckThat! 2021: Interesting claim identification and
ranking with contextually sensitive lexical training data augmentation
- Authors: Evan Williams, Paul Rodrigues, Sieu Tran
- Abstract summary: This paper discusses the approach used by the Accenture Team for CLEF2021 CheckThat! Lab, Task 1.
It identifies whether a claim made in social media would be interesting to a wide audience and should be fact-checked.
Twitter training and test data were provided in English, Arabic, Spanish, Turkish, and Bulgarian.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper discusses the approach used by the Accenture Team for CLEF2021
CheckThat! Lab, Task 1, to identify whether a claim made in social media would
be interesting to a wide audience and should be fact-checked. Twitter training
and test data were provided in English, Arabic, Spanish, Turkish, and
Bulgarian. Claims were to be classified (check-worthy/not check-worthy) and
ranked in priority order for the fact-checker. Our method used deep neural
network transformer models with contextually sensitive lexical augmentation
applied on the supplied training datasets to create additional training
samples. This augmentation approach improved the performance for all languages.
Overall, our architecture and data augmentation pipeline produced the best
submitted system for Arabic, and performance scales according to the quantity
of provided training data for English, Spanish, Turkish, and Bulgarian. This
paper investigates the deep neural network architectures for each language as
well as the provided data to examine why the approach worked so effectively for
Arabic, and discusses additional data augmentation measures that should could
be useful to this problem.
Related papers
- IAI Group at CheckThat! 2024: Transformer Models and Data Augmentation for Checkworthy Claim Detection [1.3686993145787067]
This paper describes IAI group's participation for automated check-worthiness estimation for claims.
The task involves the automated detection of check-worthy claims in English, Dutch, and Arabic political debates and Twitter data.
We utilize various pre-trained generative decoder and encoder transformer models, employing methods such as few-shot chain-of-thought reasoning.
arXiv Detail & Related papers (2024-08-02T08:59:09Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer [1.911678487931003]
Retrieval-based language models are increasingly employed in question-answering tasks.
We develop the first Norwegian retrieval-based model by adapting the REALM framework.
We show that this type of training improves the reader's performance on extractive question-answering.
arXiv Detail & Related papers (2023-04-19T13:40:47Z) - Data Augmentation using Transformers and Similarity Measures for
Improving Arabic Text Classification [0.0]
We propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2.
The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using the Euclidean, cosine, Jaccard, and BLEU distances.
The experiments were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and MOVIE.
arXiv Detail & Related papers (2022-12-28T16:38:43Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Machine Translation Pre-training for Data-to-Text Generation -- A Case
Study in Czech [5.609443065827995]
We study the effectiveness of machine translation based pre-training for data-to-text generation in non-English languages.
We find that pre-training lets us train end-to-end models with significantly improved performance.
arXiv Detail & Related papers (2020-04-05T02:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.