DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification
- URL: http://arxiv.org/abs/2508.04233v1
- Date: Wed, 06 Aug 2025 09:15:32 GMT
- Title: DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification
- Authors: Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed,
- Abstract summary: We introduce generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations.<n>To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.
- Score: 5.247930659596986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets -- RVL-CDIP, Tobacco3482, and DocLayNet -- and 3 different models -- ResNet, ConvNeXt, and DiT -- using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.
Related papers
- DREAM: Document Reconstruction via End-to-end Autoregressive Model [53.51754520966657]
We present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM)<n>We establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task.
arXiv Detail & Related papers (2025-07-08T09:24:07Z) - DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model [25.504170988714783]
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability.<n>We propose DvD, the first generative model to tackle document textbfDewarping textbfvia a textbfDiffusion framework.
arXiv Detail & Related papers (2025-05-28T05:05:51Z) - DocXplain: A Novel Model-Agnostic Explainability Method for Document Image Classification [5.247930659596986]
This paper introduces DocXplain, a novel model-agnostic explainability method specifically designed for generating high interpretability feature attribution maps.
We extensively evaluate our proposed approach in the context of document image classification, utilizing 4 different evaluation metrics.
To the best of the authors' knowledge, this work presents the first model-agnostic attribution-based explainability method specifically tailored for document images.
arXiv Detail & Related papers (2024-07-04T10:59:15Z) - Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding [88.88844606781987]
Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks.<n>The way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy.<n>In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals.
arXiv Detail & Related papers (2022-06-27T09:58:34Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Document-Level Relation Extraction with Sentences Importance Estimation
and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences.
We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss.
Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Who Explains the Explanation? Quantitatively Assessing Feature
Attribution Methods [0.0]
We propose a novel evaluation metric -- the Focus -- designed to quantify the faithfulness of explanations.
We show the robustness of the metric through randomization experiments, and then use Focus to evaluate and compare three popular explainability techniques.
Our results find LRP and GradCAM to be consistent and reliable, while the latter remains most competitive even when applied to poorly performing models.
arXiv Detail & Related papers (2021-09-28T07:10:24Z) - An Intelligent Hybrid Model for Identity Document Classification [0.0]
Digitization may provide opportunities (e.g., increase in productivity, disaster recovery, and environmentally friendly solutions) and challenges for businesses.
One of the main challenges would be to accurately classify numerous scanned documents uploaded every day by customers.
There are not many studies available to address the challenge as an application of image classification.
The proposed approach has been implemented using Python and experimentally validated on synthetic and real-world datasets.
arXiv Detail & Related papers (2021-06-07T13:08:00Z) - Hierarchical Interaction Networks with Rethinking Mechanism for
Document-level Sentiment Analysis [37.20068256769269]
Document-level Sentiment Analysis (DSA) is more challenging due to vague semantic links and complicate sentiment information.
We study how to effectively generate a discriminative representation with explicit subject patterns and sentiment contexts for DSA.
We design a Sentiment-based Rethinking mechanism (SR) by refining the HIN with sentiment label information to learn a more sentiment-aware document representation.
arXiv Detail & Related papers (2020-07-16T16:27:38Z) - Self-supervised Deep Reconstruction of Mixed Strip-shredded Text
Documents [63.41717168981103]
This work extends our previous deep learning method for single-page reconstruction to a more realistic/complex scenario.
In our approach, the compatibility evaluation is modeled as a two-class (valid or invalid) pattern recognition problem.
The proposed method outperforms the competing ones on complex scenarios, achieving accuracy superior to 90%.
arXiv Detail & Related papers (2020-07-01T21:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.