ST-KeyS: Self-Supervised Transformer for Keyword Spotting in Historical
Handwritten Documents
- URL: http://arxiv.org/abs/2303.03127v1
- Date: Mon, 6 Mar 2023 13:39:41 GMT
- Title: ST-KeyS: Self-Supervised Transformer for Keyword Spotting in Historical
Handwritten Documents
- Authors: Sana Khamekhem Jemni, Sourour Ammar, Mohamed Ali Souibgui, Yousri
Kessentini, Abbas Cheddad
- Abstract summary: Keywords spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections.
We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm.
In the fine-tuning stage, the pre-trained encoder is integrated into a siamese neural network model that is fine-tuned to improve feature embedding from the input images.
- Score: 3.9688530261646653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Keyword spotting (KWS) in historical documents is an important tool for the
initial exploration of digitized collections. Nowadays, the most efficient KWS
methods are relying on machine learning techniques that require a large amount
of annotated training data. However, in the case of historical manuscripts,
there is a lack of annotated corpus for training. To handle the data scarcity
issue, we investigate the merits of the self-supervised learning to extract
useful representations of the input data without relying on human annotations
and then using these representations in the downstream task. We propose
ST-KeyS, a masked auto-encoder model based on vision transformers where the
pretraining stage is based on the mask-and-predict paradigm, without the need
of labeled data. In the fine-tuning stage, the pre-trained encoder is
integrated into a siamese neural network model that is fine-tuned to improve
feature embedding from the input images. We further improve the image
representation using pyramidal histogram of characters (PHOC) embedding to
create and exploit an intermediate representation of images based on text
attributes. In an exhaustive experimental evaluation on three widely used
benchmark datasets (Botany, Alvermann Konzilsprotokolle and George Washington),
the proposed approach outperforms state-of-the-art methods trained on the same
datasets.
Related papers
- Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - How does a Pre-Trained Transformer Integrate Contextual Keywords?
Application to Humanitarian Computing [0.0]
This paper describes how to improve a humanitarian classification task by adding the crisis event type to each tweet to be classified.
It shows how the proposed neural network approach is partially over-fitting the particularities of the Crisis Benchmark.
arXiv Detail & Related papers (2021-11-07T11:24:08Z) - Pretrained Encoders are All You Need [23.171881382391074]
Self-supervised models have shown successful transfer to diverse settings.
We also explore fine-tuning pretrained representations with self-supervised techniques.
Our results show that pretrained representations are at par with state-of-the-art self-supervised methods trained on domain-specific data.
arXiv Detail & Related papers (2021-06-09T15:27:25Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.