Towards identifying and minimizing customer-facing documentation debt
- URL: http://arxiv.org/abs/2402.11048v1
- Date: Fri, 16 Feb 2024 19:51:04 GMT
- Title: Towards identifying and minimizing customer-facing documentation debt
- Authors: Lakmal Silva, Michael Unterkalmsteiner, Krzysztof Wnuk
- Abstract summary: Lack of correct, complete, and up-to-date documentation results in an increasing number of documentation defects.
We identify documentation defect types contributing to documentation defects, thereby identifying documentation debt.
In practice, documentation debt can easily go undetected since a large share of resources and focus is dedicated to delivering high-quality software.
- Score: 5.318531077716712
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Software documentation often struggles to catch up with the pace of software
evolution. The lack of correct, complete, and up-to-date documentation results
in an increasing number of documentation defects which could introduce delays
in integrating software systems. In our previous study on a bug analysis tool
called MultiDimEr, we provided evidence that documentation-related defects
contribute to many bug reports. First, we want to identify documentation defect
types contributing to documentation defects, thereby identifying documentation
debt. Secondly, we aim to find pragmatic solutions to minimize most common
documentation defects to pay off the documentation debt in the long run. We
investigated documentation defects related to an industrial software system.
First, we looked at different documentation types and associated bug reports.
We categorized the defects according to an existing documentation defect
taxonomy. Based on a sample of 101 defects, we found that most defects are
caused by documentation defects falling into the Information Content (What)
category (86). Within this category, the documentation defect types Erroneous
code examples (23), Missing documentation (35), and Outdated content (19)
contributed to most of the documentation defects. We propose to adapt two
solutions to mitigate these types of documentation defects. In practice,
documentation debt can easily go undetected since a large share of resources
and focus is dedicated to delivering high-quality software. We suggest adapting
two main solutions to tackle documentation debt by implementing (i) Dynamic
Documentation Generation (DDG) and/or (ii) Automated Documentation Testing
(ADT), which are both based on defining a single and robust information source
for documentation.
Related papers
- Improved Evidence Extraction for Document Inconsistency Detection with LLMs [10.610567456326235]
We introduce new comprehensive evidence-extraction metrics and a redact-and-retry framework with constrained filtering.<n>We back our claims with promising experimental results.
arXiv Detail & Related papers (2026-01-06T00:58:20Z) - On Finding Inconsistencies in Documents [6.773356807601893]
We introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert.<n>Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies.
arXiv Detail & Related papers (2025-12-21T05:20:21Z) - DocFetch - Towards Generating Software Documentation from Multiple Software Artifacts [5.780991619197141]
Existing automated approaches to generate documentation largely focus on source code.<n>We propose DocFetch, to generate different types of documentation from multiple software artifacts.<n>We evaluate the performance of DocFetch using a manually curated groundtruth dataset.
arXiv Detail & Related papers (2025-08-25T06:54:27Z) - Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model [25.504170988714783]
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability.<n>We propose DvD, the first generative model to tackle document textbfDewarping textbfvia a textbfDiffusion framework.
arXiv Detail & Related papers (2025-05-28T05:05:51Z) - WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? [64.62909376834601]
This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments.<n> evaluation of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks.
arXiv Detail & Related papers (2025-05-16T09:09:46Z) - Validating Network Protocol Parsers with Traceable RFC Document Interpretation [11.081773172066766]
oracle and traceability problems determine when a protocol implementation can be considered buggy.
This work considers both and provides an effective solution using recent advances in large language models (LLMs)
We have extensively evaluated our approach using nine network protocols and their implementations written in C, Python, and Go.
arXiv Detail & Related papers (2025-04-25T03:39:19Z) - Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence [56.09494651178128]
Retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG)
We show that retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches.
We show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs.
arXiv Detail & Related papers (2025-03-06T23:23:13Z) - METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries [10.9334354663311]
This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases.
METAMON is supported in this task by metamorphic testing and self-consistency.
An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five open-source projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with a precision of 0.72 and a recall of 0.48.
arXiv Detail & Related papers (2025-02-05T00:42:50Z) - Supporting Software Maintenance with Dynamically Generated Document Hierarchies [41.407915858583344]
We present HGEN, a fully automated pipeline that transforms source code through a series of six stages into a well-organized hierarchy of formatted documents.
We evaluate HGEN both quantitatively and qualitatively.
Results show that HGEN produces artifact hierarchies similar in quality to manually constructed documentation, with much higher coverage of the core concepts than the baseline approach.
arXiv Detail & Related papers (2024-08-11T17:11:14Z) - Managing Human-Centric Software Defects: Insights from GitHub and Practitioners' Perspectives [8.285109854002307]
Human-centric defects (HCDs) are nuanced and subjective defects that often occur due to end-user perceptions or differences.
Development teams have a limited understanding of these issues, which leads to the neglect of these defects.
Defect reporting tools do not adequately handle the capture and fixing of HCDs.
arXiv Detail & Related papers (2024-08-03T01:08:38Z) - GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence [64.95492752484171]
We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks.
We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users.
To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision.
arXiv Detail & Related papers (2024-02-19T21:45:55Z) - IncDSI: Incrementally Updatable Document Retrieval [35.5697863674097]
IncDSI is a method to add documents in real time without retraining the model on the entire dataset.
We formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters.
Our approach is competitive with re-training the model on the whole dataset.
arXiv Detail & Related papers (2023-07-19T07:20:30Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Fourier Document Restoration for Robust Document Dewarping and
Recognition [73.44057202891011]
This paper presents FDRNet, a Fourier Document Restoration Network that can restore documents with different distortions.
It dewarps documents by a flexible Thin-Plate Spline transformation which can handle various deformations effectively without requiring deformation annotations in training.
It outperforms the state-of-the-art by large margins on both dewarping and text recognition tasks.
arXiv Detail & Related papers (2022-03-18T12:39:31Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - Timestamping Documents and Beliefs [1.4467794332678539]
Document dating is a challenging problem which requires inference over the temporal structure of the document.
In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach.
We also propose AD3: Attentive Deep Document Dater, an attention-based document dating system.
arXiv Detail & Related papers (2021-06-09T02:12:18Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.