Related papers: DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification

DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification

URL: http://arxiv.org/abs/2507.06195v1
Date: Tue, 08 Jul 2025 17:22:22 GMT
Title: DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification
Authors: Maximilian Heil, Aleksandar Pramov,
Abstract summary: Numerical claims, statements involving quantities, comparisons, and temporal references pose unique challenges for automated fact-checking systems.<n>We evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline.<n>Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025.
Score: 49.1574468325115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.

Related papers

Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions [8.540135660509058]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding.<n>We leverage influence functions to attribute LLMs' reasoning ability on math and coding to individual training examples, sequences, and tokens.<n>High-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning.
arXiv Detail & Related papers (2025-05-26T13:15:26Z)
Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains [13.58151841630302]
We propose a novel method METEORA that replaces re-ranking in RAG with a rationale-driven selection approach.<n>We show METEORA improves generation accuracy by 33.34% while using approximately 50% fewer chunks than state-of-the-art re-ranking methods.<n>In adversarial settings, METEORA significantly improves the F1 score from 0.10 to 0.44.
arXiv Detail & Related papers (2025-05-21T20:57:16Z)
Improving the fact-checking performance of language models by relying on their entailment ability [2.4588375162098877]
We propose a simple yet effective strategy to improve fact-checking performance.<n>The strategy relies on the entailment ability of language models to improve the fact-checking performance.<n>We have shared our code repository to reproduce the results.
arXiv Detail & Related papers (2025-05-21T03:15:06Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z)
Can We Verify Step by Step for Incorrect Answer Detection? [22.984011562264147]
We introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks.<n>This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps.<n>We propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin.
arXiv Detail & Related papers (2024-02-16T09:29:50Z)
Chain of Evidences and Evidence to Generate: Prompting for Context Grounded and Retrieval Augmented Reasoning [3.117335706912261]
Chain of Evidences (CoE) and Evidence to Generate (E2G) are built upon two unique strategies.<n>Instead of unverified reasoning claims, our innovative approaches leverage the power of "evidence for decision making"<n>Our framework consistently achieves remarkable results across various knowledge-intensive reasoning and generation tasks.
arXiv Detail & Related papers (2024-01-11T09:49:15Z)
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society. In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC. We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z)
PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
Current Limitations of Language Models: What You Need is Retrieval [0.0]
We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models. We argue (5) would resolve many of these limitations, and it can (a) reduce the amount of supervision and (b) efficiently extend the context over the entire training dataset and the entire past of the current sample.
arXiv Detail & Related papers (2020-09-15T04:04:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.