Privacy-Aware Document Visual Question Answering
- URL: http://arxiv.org/abs/2312.10108v1
- Date: Fri, 15 Dec 2023 06:30:55 GMT
- Title: Privacy-Aware Document Visual Question Answering
- Authors: Rub\`en Tito, Khanh Nguyen, Marlon Tobaben, Raouf Kerkouche, Mohamed
Ali Souibgui, Kangsoo Jung, Lei Kang, Ernest Valveny, Antti Honkela, Mario
Fritz, Dimosthenis Karatzas
- Abstract summary: Document Visual Question Answering (DocVQA) is a fast growing branch of document understanding.
Despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees.
We highlight privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions.
- Score: 47.89754310347398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document Visual Question Answering (DocVQA) is a fast growing branch of
document understanding. Despite the fact that documents contain sensitive or
copyrighted information, none of the current DocVQA methods offers strong
privacy guarantees.
In this work, we explore privacy in the domain of DocVQA for the first time.
We highlight privacy issues in state of the art multi-modal LLM models used for
DocVQA, and explore possible solutions.
Specifically, we focus on the invoice processing use case as a realistic,
widely used scenario for document understanding, and propose a large scale
DocVQA dataset comprising invoice documents and associated questions and
answers. We employ a federated learning scheme, that reflects the real-life
distribution of documents in different businesses, and we explore the use case
where the ID of the invoice issuer is the sensitive information to be
protected.
We demonstrate that non-private models tend to memorise, behaviour that can
lead to exposing private information. We then evaluate baseline training
schemes employing federated learning and differential privacy in this
multi-modal scenario, where the sensitive information might be exposed through
any of the two input modalities: vision (document image) or language (OCR
tokens).
Finally, we design an attack exploiting the memorisation effect of the model,
and demonstrate its effectiveness in probing different DocVQA models.
Related papers
- Extracting Training Data from Document-Based VQA Models [67.1470112451617]
Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image)
We show these models can memorise responses for training samples and regurgitate them even when the relevant visual information has been removed.
This includes Personal Identifiable Information repeated once in the training set, indicating these models could divulge sensitive information and therefore pose a privacy risk.
arXiv Detail & Related papers (2024-07-11T17:44:41Z) - Federated Document Visual Question Answering: A Pilot Study [11.157766332838877]
Documents tend to be copyrighted or contain private information, which prohibits their open publication.
In this work, we explore the use of a federated learning scheme as a way to train a shared model on decentralised private document data.
We show that our pretraining strategies can effectively learn and scale up under federated training with diverse DocVQA datasets.
arXiv Detail & Related papers (2024-05-10T17:53:05Z) - BuDDIE: A Business Document Dataset for Multi-task Information Extraction [18.440587946049845]
BuDDIE is the first multi-task dataset of 1,665 real-world business documents.
Our dataset consists of publicly available business entity documents from US state government websites.
arXiv Detail & Related papers (2024-04-05T10:26:42Z) - DocPedia: Unleashing the Power of Large Multimodal Model in the
Frequency Domain for Versatile Document Understanding [98.41782470335032]
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding.
Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space.
arXiv Detail & Related papers (2023-11-20T14:42:25Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - SelfDocSeg: A Self-Supervised vision-based Approach towards Document
Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community.
With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain.
We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.