Multilingual Previously Fact-Checked Claim Retrieval
- URL: http://arxiv.org/abs/2305.07991v2
- Date: Fri, 13 Oct 2023 20:47:57 GMT
- Title: Multilingual Previously Fact-Checked Claim Retrieval
- Authors: Mat\'u\v{s} Pikuliak and Ivan Srba and Robert Moro and Timo Hromadka
and Timotej Smolen and Martin Melisek and Ivan Vykopal and Jakub Simko and
Juraj Podrouzek and Maria Bielikova
- Abstract summary: This paper introduces a new multilingual dataset -- MultiClaim -- for fact-checked claim retrieval.
We collected 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers.
We evaluated how different unsupervised methods fare on this dataset and its various dimensions.
- Score: 1.4884363206251627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fact-checkers are often hampered by the sheer amount of online content that
needs to be fact-checked. NLP can help them by retrieving already existing
fact-checks relevant to the content being investigated. This paper introduces a
new multilingual dataset -- MultiClaim -- for previously fact-checked claim
retrieval. We collected 28k posts in 27 languages from social media, 206k
fact-checks in 39 languages written by professional fact-checkers, as well as
31k connections between these two groups. This is the most extensive and the
most linguistically diverse dataset of this kind to date. We evaluated how
different unsupervised methods fare on this dataset and its various dimensions.
We show that evaluating such a diverse dataset has its complexities and proper
care needs to be taken before interpreting the results. We also evaluated a
supervised fine-tuning approach, improving upon the unsupervised method
significantly.
Related papers
- Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [17.55466402274949]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese.
We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems.
We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - Lost in Translation -- Multilingual Misinformation and its Evolution [52.07628580627591]
This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of over 250,000 unique fact-checks spanning 95 languages.
We find that while the majority of misinformation claims are only fact-checked once, 11.7%, corresponding to more than 21,000 claims, are checked multiple times.
Using fact-checks as a proxy for the spread of misinformation, we find 33% of repeated claims cross linguistic boundaries.
arXiv Detail & Related papers (2023-10-27T12:21:55Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - Matching Tweets With Applicable Fact-Checks Across Languages [27.762055254009017]
We focus on automatically finding existing fact-checks for claims made in social media posts (tweets)
We conduct both classification and retrieval experiments, in monolingual (English only), multilingual (Spanish, Portuguese), and cross-lingual (Hindi-English) settings.
We present promising results for "match" classification (93% average accuracy) in four language pairs.
arXiv Detail & Related papers (2022-02-14T23:33:02Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying
Multilingual Check-worthy Claims [6.167830237917659]
In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias.
Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages.
arXiv Detail & Related papers (2021-09-19T21:46:16Z) - X-FACT: A New Benchmark Dataset for Multilingual Fact Checking [21.2633064526968]
We introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims.
The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers.
arXiv Detail & Related papers (2021-06-17T05:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.