Test-Time Adaptation for Visual Document Understanding
- URL: http://arxiv.org/abs/2206.07240v2
- Date: Wed, 23 Aug 2023 22:54:40 GMT
- Title: Test-Time Adaptation for Visual Document Understanding
- Authors: Sayna Ebrahimi, Sercan O. Arik, Tomas Pfister
- Abstract summary: DocTTA is a novel test-time adaptation method for documents.
It does source-free domain adaptation using unlabeled target document data.
We introduce new benchmarks using existing public datasets for various VDU tasks.
- Score: 34.79168501080629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For visual document understanding (VDU), self-supervised pretraining has been
shown to successfully generate transferable representations, yet, effective
adaptation of such representations to distribution shifts at test-time remains
to be an unexplored area. We propose DocTTA, a novel test-time adaptation
method for documents, that does source-free domain adaptation using unlabeled
target document data. DocTTA leverages cross-modality self-supervised learning
via masked visual language modeling, as well as pseudo labeling to adapt models
learned on a \textit{source} domain to an unlabeled \textit{target} domain at
test time. We introduce new benchmarks using existing public datasets for
various VDU tasks, including entity recognition, key-value extraction, and
document visual question answering. DocTTA shows significant improvements on
these compared to the source model performance, up to 1.89\% in (F1 score),
3.43\% (F1 score), and 17.68\% (ANLS score), respectively. Our benchmark
datasets are available at \url{https://saynaebrahimi.github.io/DocTTA.html}.
Related papers
- Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z) - Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents [7.616682226138909]
We introduce the task of intent-based chart generation from documents.<n>The goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting.<n>We propose an attribution-based metric that uses a structured textual representation of charts.
arXiv Detail & Related papers (2025-07-20T04:34:59Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.
We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - VISA: Retrieval Augmented Generation with Visual Source Attribution [100.78278689901593]
Existing approaches in RAG primarily link generated content to document-level references.
We propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution.
To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain.
arXiv Detail & Related papers (2024-12-19T02:17:35Z) - Self-Supervised Vision Transformers for Writer Retrieval [2.949446809950691]
Methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains.
We present a novel method that extracts features from a ViT and aggregates them using VLAD encoding.
We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval.
arXiv Detail & Related papers (2024-09-01T15:29:58Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - Align Your Prompts: Test-Time Prompting with Distribution Alignment for
Zero-Shot Generalization [64.62570402941387]
We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain.
Our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe.
arXiv Detail & Related papers (2023-11-02T17:59:32Z) - Towards Open-Domain Topic Classification [69.21234350688098]
We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time.
Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface.
arXiv Detail & Related papers (2023-06-29T20:25:28Z) - GVdoc: Graph-based Visual Document Classification [17.350393956461783]
We propose GVdoc, a graph-based document classification model.
Our approach generates a document graph based on its layout, and then trains a graph neural network to learn node and graph embeddings.
We show that our model, even with fewer parameters, outperforms state-of-the-art models on out-of-distribution data.
arXiv Detail & Related papers (2023-05-26T19:23:20Z) - SelfDocSeg: A Self-Supervised vision-based Approach towards Document
Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community.
With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain.
We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z) - Spatial Dual-Modality Graph Reasoning for Key Information Extraction [31.04597531115209]
We propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images.
We release a new dataset named WildReceipt, which is collected and annotated for the evaluation of key information extraction from document images of unseen templates in the wild.
arXiv Detail & Related papers (2021-03-26T13:46:00Z) - Robust Layout-aware IE for Visually Rich Documents with Pre-trained
Language Models [23.42593796135709]
We study the problem of information extraction from visually rich documents (VRDs)
We present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents.
arXiv Detail & Related papers (2020-05-22T06:04:50Z) - Named Entity Recognition without Labelled Data: A Weak Supervision
Approach [23.05371427663683]
This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision.
The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain.
A sequence labelling model can finally be trained on the basis of this unified annotation.
arXiv Detail & Related papers (2020-04-30T12:29:55Z) - Cross-Domain Document Object Detection: Benchmark Suite and Method [71.4339949510586]
Document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding.
We investigate cross-domain DOD, where the goal is to learn a detector for the target domain using labeled data from the source domain and only unlabeled data from the target domain.
For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files.
arXiv Detail & Related papers (2020-03-30T03:04:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.