Claim Matching Beyond English to Scale Global Fact-Checking
- URL: http://arxiv.org/abs/2106.00853v1
- Date: Tue, 1 Jun 2021 23:28:05 GMT
- Title: Claim Matching Beyond English to Scale Global Fact-Checking
- Authors: Ashkan Kazemi, Kiran Garimella, Devin Gaffney and Scott A. Hale
- Abstract summary: We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims.
Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages.
We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low- and high-resource languages.
- Score: 5.836354423653351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Manual fact-checking does not scale well to serve the needs of the internet.
This issue is further compounded in non-English contexts. In this paper, we
discuss claim matching as a possible solution to scale fact-checking. We define
claim matching as the task of identifying pairs of textual messages containing
claims that can be served with one fact-check. We construct a novel dataset of
WhatsApp tipline and public group messages alongside fact-checked claims that
are first annotated for containing "claim-like statements" and then matched
with potentially similar items and annotated for claim matching. Our dataset
contains content in high-resource (English, Hindi) and lower-resource (Bengali,
Malayalam, Tamil) languages. We train our own embedding model using knowledge
distillation and a high-quality "teacher" model in order to address the
imbalance in embedding quality between the low- and high-resource languages in
our dataset. We provide evaluations on the performance of our solution and
compare with baselines and existing state-of-the-art multilingual embedding
models, namely LASER and LaBSE. We demonstrate that our performance exceeds
LASER and LaBSE in all settings. We release our annotated datasets, codebooks,
and trained embedding model to allow for further research.
Related papers
- FarFetched: Entity-centric Reasoning and Claim Validation for the Greek Language based on Textually Represented Environments [0.3874856507026475]
We address the need for automated claim validation based on the aggregated evidence derived from multiple online news sources.
We introduce an entity-centric reasoning framework in which latent connections between events, actions, or statements are revealed.
Our approach tries to fill the gap in automated claim validation for less-resourced languages.
arXiv Detail & Related papers (2024-07-13T13:30:20Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - Matching Tweets With Applicable Fact-Checks Across Languages [27.762055254009017]
We focus on automatically finding existing fact-checks for claims made in social media posts (tweets)
We conduct both classification and retrieval experiments, in monolingual (English only), multilingual (Spanish, Portuguese), and cross-lingual (Hindi-English) settings.
We present promising results for "match" classification (93% average accuracy) in four language pairs.
arXiv Detail & Related papers (2022-02-14T23:33:02Z) - conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and
Job Seekers [2.208694022993555]
We explain our task where noisy data from parsed resumes, heterogeneous nature of the different sources of data, and crosslinguality and multilinguality present domain-specific challenges.
We address these challenges by fine-tuning a Siamese Sentence Siamese-BERT (SBERT) model, which we call conSultantBERT, using a large-scale, real-world, and high quality dataset of over 270,000 resume-vacancy pairs labeled by our staffing consultants.
We show how our fine-tuned model significantly outperforms unsupervised and supervised baselines that rely on TF-IDF-weighted feature vectors and BERT embeddings
arXiv Detail & Related papers (2021-09-14T07:57:05Z) - Mixed Attention Transformer for LeveragingWord-Level Knowledge to Neural
Cross-Lingual Information Retrieval [15.902630454568811]
We propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table.
By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence.
arXiv Detail & Related papers (2021-09-07T00:33:14Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.