On Finding Inconsistencies in Documents
- URL: http://arxiv.org/abs/2512.18601v1
- Date: Sun, 21 Dec 2025 05:20:21 GMT
- Title: On Finding Inconsistencies in Documents
- Authors: Charles J. Lovering, Seth Ebner, Brandon Smock, Michael Krumdick, Saad Rabbani, Ahmed Muhammad, Varshini Reddy, Chris Tanner,
- Abstract summary: We introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert.<n>Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies.
- Score: 6.773356807601893
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.
Related papers
- Improved Evidence Extraction for Document Inconsistency Detection with LLMs [10.610567456326235]
We introduce new comprehensive evidence-extraction metrics and a redact-and-retry framework with constrained filtering.<n>We back our claims with promising experimental results.
arXiv Detail & Related papers (2026-01-06T00:58:20Z) - DocPrism: Local Categorization and External Filtering to Identify Relevant Code-Documentation Inconsistencies [5.693844702145728]
This paper introduces DocPrism, a code-documentation inconsistency detection tool.<n> DocPrism uses a standard large language model (LLM) to analyze and explain inconsistencies.<n>On a broad evaluation across Python, TypeScript, C++, and Java, DocPrism maintains a low flag rate of 15%, and achieves a precision of 0.62 without performing any fine-tuning.
arXiv Detail & Related papers (2025-10-31T19:22:54Z) - DocReward: A Document Reward Model for Structuring and Stylizing [107.03974018371058]
DocReward is a document reward model that evaluates documents based on their structure and style.<n>It is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking.<n>It achieves a significantly higher win rate of 60.8%, compared to GPT-5's 37.7% win rate.
arXiv Detail & Related papers (2025-10-13T13:36:32Z) - Improving Document Retrieval Coherence for Semantically Equivalent Queries [63.97649988164166]
We propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents.<n>The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantic equivalent queries.
arXiv Detail & Related papers (2025-08-11T13:34:59Z) - ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge [50.93758649363798]
Impliret is a benchmark that shifts the reasoning challenge to document-side processing.<n>We evaluate a range of sparse and dense retrievers, all of which struggle in this setting.
arXiv Detail & Related papers (2025-06-17T11:08:29Z) - Towards identifying and minimizing customer-facing documentation debt [5.318531077716712]
Lack of correct, complete, and up-to-date documentation results in an increasing number of documentation defects.
We identify documentation defect types contributing to documentation defects, thereby identifying documentation debt.
In practice, documentation debt can easily go undetected since a large share of resources and focus is dedicated to delivering high-quality software.
arXiv Detail & Related papers (2024-02-16T19:51:04Z) - ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models [7.428236410246183]
We introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains.
We analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset.
While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context.
arXiv Detail & Related papers (2023-11-15T18:23:17Z) - Document-Level Relation Extraction with Sentences Importance Estimation
and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences.
We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss.
Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.