CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking
- URL: http://arxiv.org/abs/2206.11863v1
- Date: Mon, 6 Jun 2022 09:11:03 GMT
- Title: CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking
- Authors: Xuming Hu, Zhijiang Guo, Guanyu Wu, Aiwei Liu, Lijie Wen, Philip S. Yu
- Abstract summary: CHEF is the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims.
The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet.
- Score: 55.75590135151682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The explosion of misinformation spreading in the media ecosystem urges for
automated fact-checking. While misinformation spans both geographic and
linguistic boundaries, most work in the field has focused on English. Datasets
and tools available in other languages, such as Chinese, are limited. In order
to bridge this gap, we construct CHEF, the first CHinese Evidence-based
Fact-checking dataset of 10K real-world claims. The dataset covers multiple
domains, ranging from politics to public health, and provides annotated
evidence retrieved from the Internet. Further, we develop established baselines
and a novel approach that is able to model the evidence retrieval as a latent
variable, allowing jointly training with the veracity prediction model in an
end-to-end fashion. Extensive experiments show that CHEF will provide a
challenging testbed for the development of fact-checking systems designed to
retrieve and reason over non-English claims.
Related papers
- RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict [34.2739191920746]
High-quality evidence plays a vital role in enhancing fact-checking systems.
We propose a method based on a Large Language Model to automatically retrieve and summarize evidence from the Web.
We construct RU22Fact, a novel explainable fact-checking dataset on the Russia-Ukraine conflict in 2022 of 16K samples.
arXiv Detail & Related papers (2024-03-25T11:56:29Z) - Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese.
We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems.
We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z) - XFEVER: Exploring Fact Verification across Languages [40.1637899493061]
This paper introduces the Cross-lingual Fact Extraction and VERification dataset designed for benchmarking the fact verification models across different languages.
We constructed it by translating the claim and evidence texts of the Fact Extraction and VERification dataset into six languages.
The training and development sets were translated using machine translation, whereas the test set includes texts translated by professional translators and machine-translated texts.
arXiv Detail & Related papers (2023-10-25T01:20:17Z) - FactLLaMA: Optimizing Instruction-Following Language Models with
External Knowledge for Automated Fact-Checking [10.046323978189847]
We propose combining the power of instruction-following language models with external evidence retrieval to enhance fact-checking performance.
Our approach involves leveraging search engines to retrieve relevant evidence for a given input claim.
Then, we instruct-tune an open-sourced language model, called LLaMA, using this evidence, enabling it to predict the veracity of the input claim more accurately.
arXiv Detail & Related papers (2023-09-01T04:14:39Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - Give Me More Details: Improving Fact-Checking with Latent Retrieval [58.706972228039604]
Evidence plays a crucial role in automated fact-checking.
Existing fact-checking systems either assume the evidence sentences are given or use the search snippets returned by the search engine.
We propose to incorporate full text from source documents as evidence and introduce two enriched datasets.
arXiv Detail & Related papers (2023-05-25T15:01:19Z) - WiCE: Real-World Entailment for Claims in Wikipedia [63.234352061821625]
We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia.
In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim.
We show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.
arXiv Detail & Related papers (2023-03-02T17:45:32Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - CsFEVER and CTKFacts: Czech Datasets for Fact Verification [0.0]
We present two Czech datasets aimed for training automated fact-checking machine learning models.
The first dataset is CsFEVER of approximately 112k claims which is an automatically generated Czech version of the well-known Wikipedia-based FEVER dataset.
The second dataset CTKFacts of 3,097 claims is built on the corpus of approximately two million Czech News Agency news reports.
arXiv Detail & Related papers (2022-01-26T18:48:42Z) - FacTeR-Check: Semi-automated fact-checking through Semantic Similarity
and Natural Language Inference [61.068947982746224]
FacTeR-Check enables retrieving fact-checked information, unchecked claims verification and tracking dangerous information over social media.
The architecture is validated using a new dataset called NLI19-SP that is publicly released with COVID-19 related hoaxes and tweets from Spanish social media.
Our results show state-of-the-art performance on the individual benchmarks, as well as producing useful analysis of the evolution over time of 61 different hoaxes.
arXiv Detail & Related papers (2021-10-27T15:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.