Related papers: NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models

NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models

URL: http://arxiv.org/abs/2601.08852v1
Date: Fri, 26 Dec 2025 19:17:21 GMT
Title: NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models
Authors: Nidhi Pandya,
Abstract summary: NewsScope is a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction.<n>The dataset contains 455 articles across politics, health, science/environment, and business.<n>LLaMA 3.1 8B was finetuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets.
Score: 0.15039745292757667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated news verification requires structured claim extraction, but existing approaches either lack schema compliance or generalize poorly across domains. This paper presents NewsScope, a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction. The dataset contains 455 articles across politics, health, science/environment, and business, consisting of 395 in-domain articles and 60 out-of-source articles for generalization testing. LLaMA 3.1 8B was fine-tuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets. Human evaluation on 400 claims shows NewsScope achieves 89.4% human-evaluated accuracy compared to GPT-4o-mini's 93.7% (p=0.07). NewsScope outperforms GPT-4o-mini on political claims (94.3% vs. 87.8%). A numeric grounding filter further improves accuracy to 91.6%, narrowing the gap to 2.1 percentage points. Inter-annotator agreement studies (160 claims) confirm labeling reliability (94.6% positive agreement on SUPPORTED judgments). The open-weight model enables offline deployment at approximately $15 on-demand compute (or $0 on free tiers). Code and benchmark are publicly released.

Related papers

Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis [0.0]
Adversarial comments produce small, statistically non-significant effects on detection accuracy.<n>Complex adversarial strategies offer no advantage over simple manipulative comments.<n>Comment stripping reduces detection for weaker models by removing helpful context.
arXiv Detail & Related papers (2026-02-18T00:34:17Z)
FormationEval, an open multiple-choice benchmark for petroleum geoscience [0.0]
FormationEval is an open multiple-choice question benchmark for evaluating language models on petroleum geoscience disciplines.<n>The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives.<n>The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist.
arXiv Detail & Related papers (2026-01-05T14:36:02Z)
A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media [11.463924147467297]
We develop a domain-adapted extraction pipeline for structured information extraction from police incident announcements.<n>We use a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo.<n>We show that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models.
arXiv Detail & Related papers (2025-12-18T05:08:26Z)
Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations [49.671779378073886]
We study question answering in the domain of radio regulations.<n>We propose a telecom-specific Retrieval-Augmented Generation (RAG) pipeline.<n>Our approach consistently improves generation accuracy across all tested models.
arXiv Detail & Related papers (2025-09-11T17:43:42Z)
An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment [0.0]
Full-text screening is the major bottleneck of systematic reviews.<n>We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem.
arXiv Detail & Related papers (2025-08-17T17:41:50Z)
Recon, Answer, Verify: Agents in Search of Truth [36.56689822791777]
We present Politi Fact Only (PFO), a benchmark dataset of 2,982 political claims from politifact.com.<n>All post claim analysis and annotator cues have been removed manually.<n>We propose RAV, an agentic framework with three agents: question generator, answer generator, and label generator.
arXiv Detail & Related papers (2025-07-04T15:44:28Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability [0.0]
Large Language Models (LLMs) have shown significant advances in text generation but often lack the reliability needed for autonomous deployment. We introduce a novel framework that repurposes ensemble methods for content validation through model consensus. In tests across 78 complex cases requiring factual accuracy and causal consistency, our framework improved precision from 73.1% to 93.9%.
arXiv Detail & Related papers (2024-11-10T17:32:16Z)
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Less is More: Fewer Interpretable Region via Submodular Subset Selection [54.07758302264416]
This paper re-models the above image attribution problem as a submodular subset selection problem. We construct a novel submodular function to discover more accurate small interpretation regions. For correctly predicted samples, the proposed method improves the Deletion and Insertion scores with an average of 4.9% and 2.5% gain relative to HSIC-Attribution.
arXiv Detail & Related papers (2024-02-14T13:30:02Z)
Box-Level Active Detection [47.41635810670186]
We introduce a box-level active detection framework that controls a box-based budget per cycle. We propose Complementary Pseudo Active Strategy (ComPAS) to exploit both human annotations and the model intelligence. ComPAS consistently outperforms 10 competitors under 4 settings in a unified setting.
arXiv Detail & Related papers (2023-03-23T08:06:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.