NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models
- URL: http://arxiv.org/abs/2601.08852v1
- Date: Fri, 26 Dec 2025 19:17:21 GMT
- Title: NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models
- Authors: Nidhi Pandya,
- Abstract summary: NewsScope is a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction.<n>The dataset contains 455 articles across politics, health, science/environment, and business.<n>LLaMA 3.1 8B was finetuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets.
- Score: 0.15039745292757667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated news verification requires structured claim extraction, but existing approaches either lack schema compliance or generalize poorly across domains. This paper presents NewsScope, a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction. The dataset contains 455 articles across politics, health, science/environment, and business, consisting of 395 in-domain articles and 60 out-of-source articles for generalization testing. LLaMA 3.1 8B was fine-tuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets. Human evaluation on 400 claims shows NewsScope achieves 89.4% human-evaluated accuracy compared to GPT-4o-mini's 93.7% (p=0.07). NewsScope outperforms GPT-4o-mini on political claims (94.3% vs. 87.8%). A numeric grounding filter further improves accuracy to 91.6%, narrowing the gap to 2.1 percentage points. Inter-annotator agreement studies (160 claims) confirm labeling reliability (94.6% positive agreement on SUPPORTED judgments). The open-weight model enables offline deployment at approximately $15 on-demand compute (or $0 on free tiers). Code and benchmark are publicly released.
Related papers
- Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis [0.0]
Adversarial comments produce small, statistically non-significant effects on detection accuracy.<n>Complex adversarial strategies offer no advantage over simple manipulative comments.<n>Comment stripping reduces detection for weaker models by removing helpful context.
arXiv Detail & Related papers (2026-02-18T00:34:17Z) - FormationEval, an open multiple-choice benchmark for petroleum geoscience [0.0]
FormationEval is an open multiple-choice question benchmark for evaluating language models on petroleum geoscience disciplines.<n>The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives.<n>The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist.
arXiv Detail & Related papers (2026-01-05T14:36:02Z) - A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media [11.463924147467297]
We develop a domain-adapted extraction pipeline for structured information extraction from police incident announcements.<n>We use a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo.<n>We show that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models.
arXiv Detail & Related papers (2025-12-18T05:08:26Z) - Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations [49.671779378073886]
We study question answering in the domain of radio regulations.<n>We propose a telecom-specific Retrieval-Augmented Generation (RAG) pipeline.<n>Our approach consistently improves generation accuracy across all tested models.
arXiv Detail & Related papers (2025-09-11T17:43:42Z) - An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment [0.0]
Full-text screening is the major bottleneck of systematic reviews.<n>We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem.
arXiv Detail & Related papers (2025-08-17T17:41:50Z) - Recon, Answer, Verify: Agents in Search of Truth [36.56689822791777]
We present Politi Fact Only (PFO), a benchmark dataset of 2,982 political claims from politifact.com.<n>All post claim analysis and annotator cues have been removed manually.<n>We propose RAV, an agentic framework with three agents: question generator, answer generator, and label generator.
arXiv Detail & Related papers (2025-07-04T15:44:28Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability [0.0]
Large Language Models (LLMs) have shown significant advances in text generation but often lack the reliability needed for autonomous deployment.
We introduce a novel framework that repurposes ensemble methods for content validation through model consensus.
In tests across 78 complex cases requiring factual accuracy and causal consistency, our framework improved precision from 73.1% to 93.9%.
arXiv Detail & Related papers (2024-11-10T17:32:16Z) - Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Less is More: Fewer Interpretable Region via Submodular Subset Selection [54.07758302264416]
This paper re-models the above image attribution problem as a submodular subset selection problem.
We construct a novel submodular function to discover more accurate small interpretation regions.
For correctly predicted samples, the proposed method improves the Deletion and Insertion scores with an average of 4.9% and 2.5% gain relative to HSIC-Attribution.
arXiv Detail & Related papers (2024-02-14T13:30:02Z) - Box-Level Active Detection [47.41635810670186]
We introduce a box-level active detection framework that controls a box-based budget per cycle.
We propose Complementary Pseudo Active Strategy (ComPAS) to exploit both human annotations and the model intelligence.
ComPAS consistently outperforms 10 competitors under 4 settings in a unified setting.
arXiv Detail & Related papers (2023-03-23T08:06:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.