Pipeline and Dataset Generation for Automated Fact-checking in Almost
Any Language
- URL: http://arxiv.org/abs/2312.10171v1
- Date: Fri, 15 Dec 2023 19:43:41 GMT
- Title: Pipeline and Dataset Generation for Automated Fact-checking in Almost
Any Language
- Authors: Jan Drchal and Herbert Ullrich and Tom\'a\v{s} Mlyn\'a\v{r} and
V\'aclav Moravec
- Abstract summary: This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data.
The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation.
We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article presents a pipeline for automated fact-checking leveraging
publicly available Language Models and data. The objective is to assess the
accuracy of textual claims using evidence from a ground-truth evidence corpus.
The pipeline consists of two main modules -- the evidence retrieval and the
claim veracity evaluation. Our primary focus is on the ease of deployment in
various languages that remain unexplored in the field of automated
fact-checking. Unlike most similar pipelines, which work with evidence
sentences, our pipeline processes data on a paragraph level, simplifying the
overall architecture and data requirements. Given the high cost of annotating
language-specific fact-checking training data, our solution builds on the
Question Answering for Claim Generation (QACG) method, which we adapt and use
to generate the data for all models of the pipeline. Our strategy enables the
introduction of new languages through machine translation of only two fixed
datasets of moderate size. Subsequently, any number of training samples can be
generated based on an evidence corpus in the target language. We provide open
access to all data and fine-tuned models for Czech, English, Polish, and Slovak
pipelines, as well as to our codebase that may be used to reproduce the
results.We comprehensively evaluate the pipelines for all four languages,
including human annotations and per-sample difficulty assessment using
Pointwise V-information. The presented experiments are based on full Wikipedia
snapshots to promote reproducibility. To facilitate implementation and user
interaction, we develop the FactSearch application featuring the proposed
pipeline and the preliminary feedback on its performance.
Related papers
- XFEVER: Exploring Fact Verification across Languages [40.1637899493061]
This paper introduces the Cross-lingual Fact Extraction and VERification dataset designed for benchmarking the fact verification models across different languages.
We constructed it by translating the claim and evidence texts of the Fact Extraction and VERification dataset into six languages.
The training and development sets were translated using machine translation, whereas the test set includes texts translated by professional translators and machine-translated texts.
arXiv Detail & Related papers (2023-10-25T01:20:17Z) - GECTurk: Grammatical Error Correction and Detection Dataset for Turkish [1.804922416527064]
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners.
Synthetic data generation is a common practice to overcome the scarcity of such data.
We present a flexible and synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules.
arXiv Detail & Related papers (2023-09-20T14:25:44Z) - A deep Natural Language Inference predictor without language-specific
training data [44.26507854087991]
We present a technique of NLP to tackle the problem of inference relation (NLI) between pairs of sentences in a target language of choice without a language-specific training dataset.
We exploit a generic translation dataset, manually translated, along with two instances of the same pre-trained model.
The model has been evaluated over machine translated Stanford NLI test dataset, machine translated Multi-Genre NLI test dataset, and manually translated RTE3-ITA test dataset.
arXiv Detail & Related papers (2023-09-06T10:20:59Z) - Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate
NLP Pipelines [0.0]
This paper presents a set of industrial-grade text processing models for Hungarian.
Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit.
All experiments are reproducible and the pipelines are freely available under a permissive license.
arXiv Detail & Related papers (2023-08-24T08:19:51Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Coalescing Global and Local Information for Procedural Text
Understanding [70.10291759879887]
A complete procedural understanding solution should combine three core aspects: local and global views of the inputs, and global view of outputs.
In this paper, we propose Coalescing Global and Local InformationCG, a new model that builds entity and time representations.
Experiments on a popular procedural text understanding dataset show that our model achieves state-of-the-art results.
arXiv Detail & Related papers (2022-08-26T19:16:32Z) - CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking [55.75590135151682]
CHEF is the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims.
The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet.
arXiv Detail & Related papers (2022-06-06T09:11:03Z) - CsFEVER and CTKFacts: Czech Datasets for Fact Verification [0.0]
We present two Czech datasets aimed for training automated fact-checking machine learning models.
The first dataset is CsFEVER of approximately 112k claims which is an automatically generated Czech version of the well-known Wikipedia-based FEVER dataset.
The second dataset CTKFacts of 3,097 claims is built on the corpus of approximately two million Czech News Agency news reports.
arXiv Detail & Related papers (2022-01-26T18:48:42Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.