Self-Supervised Vision Transformers for Writer Retrieval
- URL: http://arxiv.org/abs/2409.00751v1
- Date: Sun, 1 Sep 2024 15:29:58 GMT
- Title: Self-Supervised Vision Transformers for Writer Retrieval
- Authors: Tim Raven, Arthur Matei, Gernot A. Fink,
- Abstract summary: Methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains.
We present a novel method that extracts features from a ViT and aggregates them using VLAD encoding.
We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval.
- Score: 2.949446809950691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1\% mAP), and the HisIR19 dataset (95.0\% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6\% mAP) without any fine-tuning.
Related papers
- MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings [1.6594406786473057]
Vision Transformers (ViTs) were introduced a few years ago, but little is known about their texture recognition ability.
We introduce VORTEX, a novel method that enables the effective use of ViTs for texture analysis.
We evaluate VORTEX on nine diverse texture datasets, demonstrating its ability to achieve or surpass SOTA performance.
arXiv Detail & Related papers (2025-03-09T00:36:02Z) - VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation.
In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline.
In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z) - HTR-VT: Handwritten Text Recognition with Vision Transformer [7.997204893256558]
We explore the application of Vision Transformer (ViT) for handwritten text recognition.
Previous transformer-based models required external data or extensive pre-training on large datasets to excel.
We find that incorporating a ConAwareal Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness Minimization (SAM) encoder ensures that the model can converge towards flatter minima.
arXiv Detail & Related papers (2024-09-13T06:46:23Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - Attention Guided CAM: Visual Explanations of Vision Transformer Guided
by Self-Attention [2.466595763108917]
We propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision.
Our method provides elaborate high-level semantic explanations with great localization performance only with the class labels.
arXiv Detail & Related papers (2024-02-07T03:43:56Z) - IML-ViT: Benchmarking Image Manipulation Localization by Vision
Transformer [26.93638840931684]
Advanced image tampering techniques are challenging the trustworthiness of multimedia.
What makes a good IML model? The answer lies in the way to capture artifacts.
We term this simple but effective ViT paradigm IML-ViT, which has significant potential to become a new benchmark for IML.
arXiv Detail & Related papers (2023-07-27T13:49:27Z) - Leveraging Vision-Language Foundation Models for Fine-Grained Downstream
Tasks [17.367599062853156]
Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets.
We propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models.
arXiv Detail & Related papers (2023-07-13T15:05:34Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Unsupervised Domain Adaptation for Video Transformers in Action
Recognition [76.31442702219461]
We propose a simple and novel UDA approach for video action recognition.
Our approach builds a robust source model that better generalises to target domain.
We report results on two video action benchmarks recognition for UDA.
arXiv Detail & Related papers (2022-07-26T12:17:39Z) - Test-Time Adaptation for Visual Document Understanding [34.79168501080629]
DocTTA is a novel test-time adaptation method for documents.
It does source-free domain adaptation using unlabeled target document data.
We introduce new benchmarks using existing public datasets for various VDU tasks.
arXiv Detail & Related papers (2022-06-15T01:57:12Z) - MDMMT: Multidomain Multimodal Transformer for Video Retrieval [63.872634680339644]
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks.
We show that training on different datasets can improve test results of each other.
arXiv Detail & Related papers (2021-03-19T09:16:39Z) - Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic
Parsing [85.35582118010608]
Task-oriented semantic parsing is a critical component of virtual assistants.
Recent advances in deep learning have enabled several approaches to successfully parse more complex queries.
We propose a novel method that outperforms a supervised neural model at a 10-fold data reduction.
arXiv Detail & Related papers (2020-10-07T17:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.