CsFEVER and CTKFacts: Czech Datasets for Fact Verification
- URL: http://arxiv.org/abs/2201.11115v1
- Date: Wed, 26 Jan 2022 18:48:42 GMT
- Title: CsFEVER and CTKFacts: Czech Datasets for Fact Verification
- Authors: Jan Drchal, Herbert Ullrich, Martin R\'ypar, Hana Vincourov\'a,
V\'aclav Moravec
- Abstract summary: We present two Czech datasets aimed for training automated fact-checking machine learning models.
The first dataset is CsFEVER of approximately 112k claims which is an automatically generated Czech version of the well-known Wikipedia-based FEVER dataset.
The second dataset CTKFacts of 3,097 claims is built on the corpus of approximately two million Czech News Agency news reports.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we present two Czech datasets aimed for training automated
fact-checking machine learning models. Specifically we deal with the task of
assessment of a textual claim veracity w.r.t. to a (presumably) verified
corpus. The output of the system is the claim classification SUPPORTS or
REFUTES complemented with evidence documents or NEI (Not Enough Info) alone. In
the first place we publish CsFEVER of approximately 112k claims which is an
automatically generated Czech version of the well-known Wikipedia-based FEVER
dataset. We took a hybrid approach of machine translation and language
alignment, where the same method (and tools we provide) can be easily applied
to other languages. The second dataset CTKFacts of 3,097 claims is built on the
corpus of approximately two million Czech News Agency news reports. We present
an extended methodology based on the FEVER approach. Most notably, we describe
a method to automatically generate wider claim contexts (dictionaries) for
non-hyperlinked corpora. The datasets are analyzed for spurious cues, which are
annotation patterns leading to model overfitting. CTKFacts is further examined
for inter-annotator agreement, and a typology of common annotator errors is
extracted. Finally, we provide baseline models for all stages of the
fact-checking pipeline.
Related papers
- Pipeline and Dataset Generation for Automated Fact-checking in Almost
Any Language [0.0]
This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data.
The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation.
We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines.
arXiv Detail & Related papers (2023-12-15T19:43:41Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - A Machine Learning Approach to Classifying Construction Cost Documents
into the International Construction Measurement Standard [0.0]
We introduce the first automated models for classifying natural language descriptions provided in cost documents called "Bills of Quantities"
We learn from a dataset of more than 50 thousand descriptions of items retrieved from 24 large infrastructure construction projects across the United Kingdom.
arXiv Detail & Related papers (2022-10-24T11:35:53Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking [55.75590135151682]
CHEF is the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims.
The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet.
arXiv Detail & Related papers (2022-06-06T09:11:03Z) - Leveraging Advantages of Interactive and Non-Interactive Models for
Vector-Based Cross-Lingual Information Retrieval [12.514666775853598]
We propose a novel framework to leverage the advantages of interactive and non-interactive models.
We introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries.
Our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
arXiv Detail & Related papers (2021-11-03T03:03:19Z) - Reading Comprehension in Czech via Machine Translation and Cross-lingual
Transfer [2.8273701718153563]
This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data.
We automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data.
We then trained and evaluated several BERT and XLM-RoBERTa baseline models.
arXiv Detail & Related papers (2020-07-03T13:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.