Finding Already Debunked Narratives via Multistage Retrieval: Enabling
Cross-Lingual, Cross-Dataset and Zero-Shot Learning
- URL: http://arxiv.org/abs/2308.05680v1
- Date: Thu, 10 Aug 2023 16:33:17 GMT
- Title: Finding Already Debunked Narratives via Multistage Retrieval: Enabling
Cross-Lingual, Cross-Dataset and Zero-Shot Learning
- Authors: Iknoor Singh, Carolina Scarton, Xingyi Song, Kalina Bontcheva
- Abstract summary: This paper creates a novel dataset to enable research on cross-lingual retrieval of debunked narratives.
It presents an experiment to benchmark fine-tuned and off-the-shelf multilingual pre-trained Transformer models for this task.
It also proposes a novel multistage framework that divides this cross-lingual debunk retrieval task into refinement and re-ranking stages.
- Score: 6.094795148759833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of retrieving already debunked narratives aims to detect stories
that have already been fact-checked. The successful detection of claims that
have already been debunked not only reduces the manual efforts of professional
fact-checkers but can also contribute to slowing the spread of misinformation.
Mainly due to the lack of readily available data, this is an understudied
problem, particularly when considering the cross-lingual task, i.e. the
retrieval of fact-checking articles in a language different from the language
of the online post being checked. This paper fills this gap by (i) creating a
novel dataset to enable research on cross-lingual retrieval of already debunked
narratives, using tweets as queries to a database of fact-checking articles;
(ii) presenting an extensive experiment to benchmark fine-tuned and
off-the-shelf multilingual pre-trained Transformer models for this task; and
(iii) proposing a novel multistage framework that divides this cross-lingual
debunk retrieval task into refinement and re-ranking stages. Results show that
the task of cross-lingual retrieval of already debunked narratives is
challenging and off-the-shelf Transformer models fail to outperform a strong
lexical-based baseline (BM25). Nevertheless, our multistage retrieval framework
is robust, outperforming BM25 in most scenarios and enabling cross-domain and
zero-shot learning, without significantly harming the model's performance.
Related papers
- Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches [5.850200023135349]
We examine strategies to improve the multilingual and crosslingual performance.<n>We evaluate approaches on a dataset containing posts and claims in 47 languages.<n>Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.
arXiv Detail & Related papers (2025-05-28T08:47:10Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From [61.63091726904068]
We evaluate the cross-lingual context retrieval ability of over 40 large language models (LLMs) across 12 languages.
Several small, post-trained open LLMs show strong cross-lingual context retrieval ability.
Our results also indicate that larger-scale pretraining cannot improve the xMRC performance.
arXiv Detail & Related papers (2025-04-15T06:35:27Z) - Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples [38.18495961129682]
This paper introduces a novel cross-lingual search task that does not require a large semantic corpus.
It focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than challenging distractors generated by a large language model.
We create a case study of our introduced CLSD task for the language pair German-French in the news domain.
arXiv Detail & Related papers (2025-02-12T18:54:37Z) - mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.
We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.
We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages.
Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents.
Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Cross-lingual Editing in Multilingual Language Models [1.3062731746155414]
This paper introduces the cross-lingual model editing (textbfXME) paradigm, wherein a fact is edited in one language, and the subsequent update propagation is observed across other languages.
The results reveal notable performance limitations of state-of-the-art METs under the XME setting, mainly when the languages involved belong to two distinct script families.
arXiv Detail & Related papers (2024-01-19T06:54:39Z) - Cross-lingual Transfer Learning for Check-worthy Claim Identification
over Twitter [7.601937548486356]
Misinformation spread over social media has become an undeniable infodemic.
We present a systematic study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model.
Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language.
arXiv Detail & Related papers (2022-11-09T18:18:53Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - Matching Tweets With Applicable Fact-Checks Across Languages [27.762055254009017]
We focus on automatically finding existing fact-checks for claims made in social media posts (tweets)
We conduct both classification and retrieval experiments, in monolingual (English only), multilingual (Spanish, Portuguese), and cross-lingual (Hindi-English) settings.
We present promising results for "match" classification (93% average accuracy) in four language pairs.
arXiv Detail & Related papers (2022-02-14T23:33:02Z) - One Question Answering Model for Many Languages with Cross-lingual Dense
Passage Retrieval [39.061900747689094]
CORA is a Cross-lingual Open-Retrieval Answer Generation model.
It can answer questions across many languages even when language-specific annotated data or knowledge sources are unavailable.
arXiv Detail & Related papers (2021-07-26T06:02:54Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.