Related papers: Recon, Answer, Verify: Agents in Search of Truth

Recon, Answer, Verify: Agents in Search of Truth

URL: http://arxiv.org/abs/2507.03671v1
Date: Fri, 04 Jul 2025 15:44:28 GMT
Title: Recon, Answer, Verify: Agents in Search of Truth
Authors: Satyam Shukla, Himanshu Dutta, Pushpak Bhattacharyya,
Abstract summary: We present Politi Fact Only (PFO), a benchmark dataset of 2,982 political claims from politifact.com.<n>All post claim analysis and annotator cues have been removed manually.<n>We propose RAV, an agentic framework with three agents: question generator, answer generator, and label generator.
Score: 36.56689822791777
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from politifact.com, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim's verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO's unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.

Related papers

SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation [0.0]
SpatialBench-UC is a small, reproducible benchmark for pairwise spatial relations.<n>We release a benchmark package, versioned prompts, pinned configs, per-sample checker outputs, and report tables.<n>We evaluate three baselines, Stable Diffusion 1.5, SD 1.5 BoxDiff, and SD 1.4 GLIGEN.
arXiv Detail & Related papers (2026-01-19T23:37:10Z)
Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search [3.282845873351502]
We evaluate 15 recent large language models (LLMs) on more than 6,000 claims fact-checked by PolitiFact.<n>Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains.<n>A curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants.
arXiv Detail & Related papers (2025-11-24T04:22:32Z)
PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora [0.0]
We introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English.<n>We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents.<n>We propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach.
arXiv Detail & Related papers (2025-10-16T07:22:58Z)
Fact or Facsimile? Evaluating the Factual Robustness of Modern Retrievers [34.31192184496381]
Dense retrievers and rerankers are central to retrieval-augmented generation (RAG) pipelines.<n>We evaluate how much factual competence these components inherit or lose from large language models (LLMs) they are based on.<n>For every embedding model, cosine-similarity scores between queries and correct completions are significantly higher than those for incorrect ones.
arXiv Detail & Related papers (2025-08-28T04:13:51Z)
RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment [0.0]
We introduce RePaCA, a novel static APCA technique that leverages Large Language Models (LLMs) specialized in thinking tasks.<n>Our approach achieves state-of-the-art performance, with 83.1% accuracy and an 84.8% F1-score.
arXiv Detail & Related papers (2025-07-30T11:21:09Z)
DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification [49.1574468325115]
Numerical claims, statements involving quantities, comparisons, and temporal references pose unique challenges for automated fact-checking systems.<n>We evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline.<n>Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025.
arXiv Detail & Related papers (2025-07-08T17:22:22Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains [13.58151841630302]
We propose a novel method METEORA that replaces re-ranking in RAG with a rationale-driven selection approach.<n>We show METEORA improves generation accuracy by 33.34% while using approximately 50% fewer chunks than state-of-the-art re-ranking methods.<n>In adversarial settings, METEORA significantly improves the F1 score from 0.10 to 0.44.
arXiv Detail & Related papers (2025-05-21T20:57:16Z)
AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a simple RAG task [0.0]
This paper describes our solution to the challenge of fact-checking with evidence retrieved in the wild using a simple scheme of Retrieval-Augmented Generation (RAG) We release our and explain its two modules - the Retriever and the Evidence & Label generator - in detail, justifying their features such as MMR-reranking and Likert-scale confidence estimation. We perform an empirical error analysis to see that faults in our predictions often coincide with noise in the data or ambiguous fact-checks, provoking further research and data augmentation.
arXiv Detail & Related papers (2024-10-15T09:50:19Z)
Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain. We propose an adversarial algorithm to make the retriever component robust against distribution shift. We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z)
AttributionBench: How Hard is Automatic Attribution Evaluation? [19.872081697282002]
We present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our experiments show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information.
arXiv Detail & Related papers (2024-02-23T04:23:33Z)
The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education. We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.