Related papers: Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak

Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak

URL: http://arxiv.org/abs/2511.14566v1
Date: Tue, 18 Nov 2025 15:09:09 GMT
Title: Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak
Authors: Lucia Makaiová, Martin Fajčík, Antonín Jarolím,
Abstract summary: Document-level claim extraction remains an open challenge in the field of fact-checking.<n>We investigate techniques to identify the best possible alignment and evaluation method between claim sets.<n>We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.

Related papers

FactSim: Fact-Checking for Opinion Summarization [0.0]
We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks.<n>Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, have shown limitations due to the paradigm shift introduced by large language models (LLM)<n>This paper proposes a novel, fully automated methodology for assessing the factual consistency of such summaries.
arXiv Detail & Related papers (2026-02-09T14:21:19Z)
T-Retrievability: A Topic-Focused Approach to Measure Fair Document Exposure in Information Retrieval [22.953432572278597]
We propose a topic-focused localised retrievability measure, which first computes retrievability scores over multiple groups of topically-related documents.<n>Our analysis uncovers new insights into the exposure characteristics of various neural ranking models.
arXiv Detail & Related papers (2025-08-29T15:14:16Z)
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Towards Effective Extraction and Evaluation of Factual Claims [1.8262547855491458]
A common strategy for fact-checking long-form content generated by Large Language Models (LLMs) is extracting simple claims that can be verified independently.<n>We propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework.<n>We also introduce Claimify, an LLM-based claim extraction method, and demonstrate that it outperforms existing methods under our evaluation framework.
arXiv Detail & Related papers (2025-02-15T16:58:05Z)
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [65.8478860180793]
Event extraction has gained extensive research attention due to its broad range of applications.<n>Current evaluation method for event extraction relies on token-level exact match.<n>We propose a reliable and semantic evaluation framework for event extraction, named RAEE.
arXiv Detail & Related papers (2024-10-12T07:54:01Z)
Document-level Claim Extraction and Decontextualisation for Fact-Checking [11.994189446360433]
We propose a method for document-level claim extraction for fact-checking. We first recast claim extraction as extractive summarization in order to identify central sentences from documents. We then rewrite them to include necessary context from the originating document through sentence decontextualisation.
arXiv Detail & Related papers (2024-06-05T13:16:46Z)
Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion [78.76867266561537]
The evaluation process still heavily relies on closed-set metrics without considering the similarity between predicted and ground truth categories. To tackle this issue, we first survey eleven similarity measurements between two categorical words. We designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks.
arXiv Detail & Related papers (2023-11-06T18:59:01Z)
Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics. We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
SAIS: Supervising and Augmenting Intermediate Steps for Document-Level Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction. Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z)
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z)
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.