Related papers: DocVXQA: Context-Aware Visual Explanations for Document Question Answering

DocVXQA: Context-Aware Visual Explanations for Document Question Answering

URL: http://arxiv.org/abs/2505.07496v1
Date: Mon, 12 May 2025 12:30:16 GMT
Title: DocVXQA: Context-Aware Visual Explanations for Document Question Answering
Authors: Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, Dimosthenis Karatzas,
Abstract summary: We propose DocVXQA, a novel framework for visually self-explainable document question answering.<n>The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions.
Score: 12.416787701296236
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.

Related papers

A Counterfactual Explanation Framework for Retrieval Models [4.562474301450839]
We use a counterfactual framework to solve the question of which words played a role in not being favored in a document by a retrieval model.<n>Our experiments show the effectiveness of our proposed approach in predicting counterfactuals for both statistical (e.g. BM25) and deep-learning-based models.
arXiv Detail & Related papers (2024-09-01T22:33:29Z)
Answer is All You Need: Instruction-following Text Embedding via Answering the Question [41.727700155498546]
This paper offers a new viewpoint, which treats the instruction as a question about the input text and encodes the expected answers to obtain the representation accordingly. Specifically, we propose InBedder that instantiates this embed-via-answering idea by only fine-tuning language models on abstractive question answering tasks.
arXiv Detail & Related papers (2024-02-15T01:02:41Z)
A Simple Baseline for Knowledge-Based Visual Question Answering [78.00758742784532]
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA) Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets.
arXiv Detail & Related papers (2023-10-20T15:08:17Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
R$^2$F: A General Retrieval, Reading and Fusion Framework for Document-level Natural Language Inference [29.520857954199904]
Document-level natural language inference (DOCNLI) is a new challenging task in natural language processing. We establish a general solution, named Retrieval, Reading and Fusion (R2F) framework, and a new setting. Our experimental results show that R2F framework can obtain state-of-the-art performance and is robust for diverse evidence retrieval methods.
arXiv Detail & Related papers (2022-10-22T02:02:35Z)
Supporting Vision-Language Model Inference with Confounder-pruning Knowledge Prompt [71.77504700496004]
Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open-set visual concepts. To boost the transferability of the pre-trained models, recent works adopt fixed or learnable prompts. However, how and what prompts can improve inference performance remains unclear.
arXiv Detail & Related papers (2022-05-23T07:51:15Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
elBERto: Self-supervised Commonsense Learning for Question Answering [131.51059870970616]
We propose a Self-supervised Bidirectional Representation Learning of Commonsense framework, which is compatible with off-the-shelf QA model architectures. The framework comprises five self-supervised tasks to force the model to fully exploit the additional training signals from contexts containing rich commonsense. elBERto achieves substantial improvements on out-of-paragraph and no-effect questions where simple lexical similarity comparison does not help.
arXiv Detail & Related papers (2022-03-17T16:23:45Z)
Grow-and-Clip: Informative-yet-Concise Evidence Distillation for Answer Explanation [22.20733260041759]
We argue that the evidences of an answer is critical to enhancing the interpretability of QA models. We are the first to explicitly define the concept of evidence as the supporting facts in a context which are informative, concise, and readable. We propose Grow-and-Clip Evidence Distillation (GCED) algorithm to extract evidences from the contexts by trade-off informativeness, conciseness, and readability.
arXiv Detail & Related papers (2022-01-13T17:18:17Z)
Visual Question Answering with Prior Class Semantics [50.845003775809836]
We show how to exploit additional information pertaining to the semantics of candidate answers. We extend the answer prediction process with a regression objective in a semantic space. Our method brings improvements in consistency and accuracy over a range of question types.
arXiv Detail & Related papers (2020-05-04T02:46:31Z)
Robust Explanations for Visual Question Answering [24.685231217726194]
We propose a method to obtain robust explanations for visual question answering(VQA) that correlate well with the answers. Our model explains the answers obtained through a VQA model by providing visual and textual explanations. We showcase the robustness of the model against a noise-based perturbation attack using corresponding visual and textual explanations.
arXiv Detail & Related papers (2020-01-23T18:43:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.