DocPrism: Local Categorization and External Filtering to Identify Relevant Code-Documentation Inconsistencies
- URL: http://arxiv.org/abs/2511.00215v1
- Date: Fri, 31 Oct 2025 19:22:54 GMT
- Title: DocPrism: Local Categorization and External Filtering to Identify Relevant Code-Documentation Inconsistencies
- Authors: Xiaomeng Xu, Zahin Wahab, Reid Holmes, Caroline Lemieux,
- Abstract summary: This paper introduces DocPrism, a code-documentation inconsistency detection tool.<n> DocPrism uses a standard large language model (LLM) to analyze and explain inconsistencies.<n>On a broad evaluation across Python, TypeScript, C++, and Java, DocPrism maintains a low flag rate of 15%, and achieves a precision of 0.62 without performing any fine-tuning.
- Score: 5.693844702145728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code-documentation inconsistencies are common and undesirable: they can lead to developer misunderstandings and software defects. This paper introduces DocPrism, a multi-language, code-documentation inconsistency detection tool. DocPrism uses a standard large language model (LLM) to analyze and explain inconsistencies. Plain use of LLMs for this task yield unacceptably high false positive rates: LLMs identify natural gaps between high-level documentation and detailed code implementations as inconsistencies. We introduce and apply the Local Categorization, External Filtering (LCEF) methodology to reduce false positives. LCEF relies on the LLM's local completion skills rather than its long-term reasoning skills. In our ablation study, LCEF reduces DocPrism's inconsistency flag rate from 98% to 14%, and increases accuracy from 14% to 94%. On a broad evaluation across Python, TypeScript, C++, and Java, DocPrism maintains a low flag rate of 15%, and achieves a precision of 0.62 without performing any fine-tuning.
Related papers
- Detecting and Correcting Hallucinations in LLM-Generated Code via Deterministic AST Analysis [11.687400527666476]
This paper investigates whether a deterministic, static-analysis framework can reliably detect textitand auto-correct KCHs.<n>We propose a post-processing framework that parses generated code into an Abstract Syntax Tree (AST) and validates it against a dynamically-generated Knowledge Base (KB)<n>This non-executing approach uses deterministic rules to find and fix both API and identifier-level conflicts.
arXiv Detail & Related papers (2026-01-27T02:16:37Z) - DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity [10.808479217513181]
DoPE is a document-layer defense framework that embeds semantic decoys into PDF/ HTML assessments.<n>FewSoRT-Q generates question-level semantic decoys and FewSoRT-D to encapsulate them into watermarked documents.<n>DoPE yields strong empirical gains against black-box MLLMs from OpenAI and Anthropic.
arXiv Detail & Related papers (2026-01-18T17:34:29Z) - ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge [50.93758649363798]
Impliret is a benchmark that shifts the reasoning challenge to document-side processing.<n>We evaluate a range of sparse and dense retrievers, all of which struggle in this setting.
arXiv Detail & Related papers (2025-06-17T11:08:29Z) - Long-Form Information Alignment Evaluation Beyond Atomic Facts [60.25969380388974]
We introduce MontageLie, a benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations.<n>We propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency.
arXiv Detail & Related papers (2025-05-21T17:46:38Z) - METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries [10.9334354663311]
This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases.<n> METAMON is supported in this task by metamorphic testing and self-consistency.<n>An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five open-source projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with a precision of 0.72 and a recall of 0.48.
arXiv Detail & Related papers (2025-02-05T00:42:50Z) - Utilizing Precise and Complete Code Context to Guide LLM in Automatic False Positive Mitigation [2.787944528438214]
Static Application Security Testing (SAST) tools are critical to software quality, identifying potential code issues early in development.<n>They often produce false positive warnings that require manual review, slowing down development.<n>We propose LLM4FPM, a lightweight and efficient false positive mitigation framework.
arXiv Detail & Related papers (2024-11-05T13:24:56Z) - Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation [15.987448306012167]
Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT)<n>This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT)
arXiv Detail & Related papers (2024-10-28T11:49:58Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Precise Zero-Shot Dense Retrieval without Relevance Labels [60.457378374671656]
Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system.
We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
arXiv Detail & Related papers (2022-12-20T18:09:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.