Related papers: On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

URL: http://arxiv.org/abs/2509.25494v1
Date: Mon, 29 Sep 2025 20:50:40 GMT
Title: On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search
Authors: Nick Hagar, Nicholas Diakopoulos, Jeremy Gilbert,
Abstract summary: Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery.<n>We present a journalist-centered approach to search that prioritizes transparency and editorial control through a five-stage pipeline.<n>We evaluate three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora and find substantial variation in reliability.
Score: 2.853035319109148
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Investigative journalists routinely confront large document collections. Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery, but newsroom adoption remains limited due to hallucination risks, verification burden, and data privacy concerns. We present a journalist-centered approach to LLM-powered document search that prioritizes transparency and editorial control through a five-stage pipeline -- corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis -- using small, locally-deployable language models that preserve data security and maintain complete auditability through explicit citation chains. Evaluating three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we find substantial variation in reliability. All models achieved high citation validity and ran effectively on standard desktop hardware (e.g., 24 GB of memory), demonstrating feasibility for resource-constrained newsrooms. However, systematic challenges emerged, including error propagation through multi-stage synthesis and dramatic performance variation based on training data overlap with corpus content. These findings suggest that effective newsroom AI deployment requires careful model selection and system design, alongside human oversight for maintaining standards of accuracy and accountability.

Related papers

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z)
ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction [14.012874564599272]
ZoFia is a novel two-stage zero-shot fake news detection framework.<n>First, we introduce Hierarchical Salience to quantify the importance of entities in the news content.<n>We then propose the SC-MMR algorithm to effectively select an informative and diverse set of keywords.
arXiv Detail & Related papers (2025-11-03T03:29:42Z)
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation [72.34977512403643]
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus.<n>Existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images)<n>We propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for Universal Retrieval-Augmented Generation scenarios.
arXiv Detail & Related papers (2025-10-20T09:56:43Z)
Towards Repository-Level Program Verification with Large Language Models [8.05666536952624]
scaling automated formal verification to real-world projects requires resolving cross-module dependencies and global contexts.<n>We introduce RVBench, the first verification benchmark explicitly designed for repository-level evaluation, constructed from four diverse and complex open-source Verus projects.<n>RagedVerus is a framework that synergizes retrieval-augmented generation with context-aware prompting to automate proof for multi-module repositories.
arXiv Detail & Related papers (2025-08-31T02:44:04Z)
NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks [21.577527868033343]
NEWSAGENT is a benchmark for evaluating how agents can automatically search available raw contents, select desired information, and edit and rephrase to form a news article.<n> NEWSAGENT includes 6k human-verified examples derived from real news, with multimodal contents converted to text for broad model compatibility.<n>We believe NEWSAGENT serves a realistic testbed for iterating and evaluating agent capabilities in terms of multimodal web data manipulation to real-world productivity.
arXiv Detail & Related papers (2025-08-30T10:31:34Z)
Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? [16.717935491483146]
Double-Bench is a large-scale, multilingual, and multimodal evaluation system.<n>It produces fine-grained assessment to each component within document RAG systems.<n>It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages.
arXiv Detail & Related papers (2025-08-05T16:55:02Z)
BiMark: Unbiased Multilayer Watermarking for Large Language Models [68.64050157343334]
We propose BiMark, a novel watermarking framework that balances text quality preservation and message embedding capacity.<n>BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity.
arXiv Detail & Related papers (2025-06-19T11:08:59Z)
CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision [15.604947362541415]
CrEst is a weakly supervised framework for assessing the credibility of context documents during inference.<n>Experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-06-17T18:44:21Z)
Document Attribution: Examining Citation Relationships using Large Language Models [62.46146670035751]
We propose a zero-shot approach that frames attribution as a straightforward textual entailment task.<n>We also explore the role of the attention mechanism in enhancing the attribution process.
arXiv Detail & Related papers (2025-05-09T04:40:11Z)
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.<n>FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.<n>Our study reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.
arXiv Detail & Related papers (2024-09-30T06:27:53Z)
Re-Search for The Truth: Multi-round Retrieval-augmented Large Language Models are Strong Fake News Detectors [38.75533934195315]
Large Language Models (LLMs) are known for their remarkable reasoning and generative capabilities. We introduce a novel, retrieval-augmented LLMs framework--the first of its kind to automatically and strategically extract key evidence from web sources for claim verification. Our framework ensures the acquisition of sufficient, relevant evidence, thereby enhancing performance.
arXiv Detail & Related papers (2024-03-14T00:35:39Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
You Can Generate It Again: Data-to-Text Generation with Verification and Correction Prompting [24.738004421537926]
Small language models like T5 excel in generating high-quality text for data-to-text tasks.<n>They frequently miss keywords, which is considered one of the most severe and common errors in this task.<n>We explore the potential of using feedback systems to enhance semantic fidelity in smaller language models for data-to-text generation tasks.
arXiv Detail & Related papers (2023-06-28T05:34:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.