XFEVER: Exploring Fact Verification across Languages
- URL: http://arxiv.org/abs/2310.16278v1
- Date: Wed, 25 Oct 2023 01:20:17 GMT
- Title: XFEVER: Exploring Fact Verification across Languages
- Authors: Yi-Chen Chang, Canasai Kruengkrai, Junichi Yamagishi
- Abstract summary: This paper introduces the Cross-lingual Fact Extraction and VERification dataset designed for benchmarking the fact verification models across different languages.
We constructed it by translating the claim and evidence texts of the Fact Extraction and VERification dataset into six languages.
The training and development sets were translated using machine translation, whereas the test set includes texts translated by professional translators and machine-translated texts.
- Score: 40.1637899493061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces the Cross-lingual Fact Extraction and VERification
(XFEVER) dataset designed for benchmarking the fact verification models across
different languages. We constructed it by translating the claim and evidence
texts of the Fact Extraction and VERification (FEVER) dataset into six
languages. The training and development sets were translated using machine
translation, whereas the test set includes texts translated by professional
translators and machine-translated texts. Using the XFEVER dataset, two
cross-lingual fact verification scenarios, zero-shot learning and
translate-train learning, are defined, and baseline models for each scenario
are also proposed in this paper. Experimental results show that the
multilingual language model can be used to build fact verification models in
different languages efficiently. However, the performance varies by language
and is somewhat inferior to the English case. We also found that we can
effectively mitigate model miscalibration by considering the prediction
similarity between the English and target languages. The XFEVER dataset, code,
and model checkpoints are available at
https://github.com/nii-yamagishilab/xfever.
Related papers
- A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification [1.566834021297545]
This study systematically evaluates translation bias and the effectiveness of Large Language Models for cross-lingual claim verification.
We investigate two distinct translation methods: pre-translation and self-translation.
Our findings reveal that low-resource languages exhibit significantly lower accuracy in direct inference due to underrepresentation.
arXiv Detail & Related papers (2024-10-14T09:02:42Z) - Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese.
We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems.
We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z) - Improving Polish to English Neural Machine Translation with Transfer
Learning: Effects of Data Volume and Language Similarity [2.4674086273775035]
We investigate the impact of data volume and the use of similar languages on transfer learning in a machine translation task.
We fine-tune mBART model for a Polish-English translation task using the OPUS-100 dataset.
Our experiments show that a combination of related languages and larger amounts of data outperforms the model trained on related languages or larger amounts of data alone.
arXiv Detail & Related papers (2023-06-01T13:34:21Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with
Synthetic Data [2.225882303328135]
We propose a novel Translate-and-Fill (TaF) method to produce silver training data for a multilingual semantic parsing task.
Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems.
arXiv Detail & Related papers (2021-09-09T14:51:11Z) - X-FACT: A New Benchmark Dataset for Multilingual Fact Checking [21.2633064526968]
We introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims.
The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers.
arXiv Detail & Related papers (2021-06-17T05:09:54Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.