VGTS: Visually Guided Text Spotting for Novel Categories in Historical Manuscripts
- URL: http://arxiv.org/abs/2304.00746v4
- Date: Fri, 29 Mar 2024 13:32:53 GMT
- Title: VGTS: Visually Guided Text Spotting for Novel Categories in Historical Manuscripts
- Authors: Wenbo Hu, Hongjian Zhan, Xinchen Ma, Cong Liu, Bing Yin, Yue Lu,
- Abstract summary: We propose a Visually Guided Text Spotting (VGTS) approach that accurately spots novel characters using just one annotated support sample.
The DSA block aims to identify, focus on, and learn discriminative spatial regions in the support and query images, mimicking the human visual spotting process.
To tackle the example imbalance problem in low-resource spotting tasks, we develop a novel torus loss function that enhances the discriminative power of the embedding space for distance metric learning.
- Score: 26.09365732823049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of historical manuscript research, scholars frequently encounter novel symbols in ancient texts, investing considerable effort in their identification and documentation. Although existing object detection methods achieve impressive performance on known categories, they struggle to recognize novel symbols without retraining. To address this limitation, we propose a Visually Guided Text Spotting (VGTS) approach that accurately spots novel characters using just one annotated support sample. The core of VGTS is a spatial alignment module consisting of a Dual Spatial Attention (DSA) block and a Geometric Matching (GM) block. The DSA block aims to identify, focus on, and learn discriminative spatial regions in the support and query images, mimicking the human visual spotting process. It first refines the support image by analyzing inter-channel relationships to identify critical areas, and then refines the query image by focusing on informative key points. The GM block, on the other hand, establishes the spatial correspondence between the two images, enabling accurate localization of the target character in the query image. To tackle the example imbalance problem in low-resource spotting tasks, we develop a novel torus loss function that enhances the discriminative power of the embedding space for distance metric learning. To further validate our approach, we introduce a new dataset featuring ancient Dongba hieroglyphics (DBH) associated with the Naxi minority of China. Extensive experiments on the DBH dataset and other public datasets, including EGY, VML-HD, TKH, and NC, show that VGTS consistently surpasses state-of-the-art methods. The proposed framework exhibits great potential for application in historical manuscript text spotting, enabling scholars to efficiently identify and document novel symbols with minimal annotation effort.
Related papers
- LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - Direction-Oriented Visual-semantic Embedding Model for Remote Sensing
Image-text Retrieval [8.00022369501487]
We propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language.
Our highlight is to conduct visual and textual representations in latent space, directing them as close as possible to a redundancy-free regional visual representation.
We exploit a global visual-semantic constraint to reduce single visual dependency and serve as an external constraint for the final visual and textual representations.
arXiv Detail & Related papers (2023-10-12T12:28:47Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Spatial Reasoning for Few-Shot Object Detection [21.3564383157159]
We propose a spatial reasoning framework that detects novel objects with only a few training examples in a context.
We employ a graph convolutional network as the RoIs and their relatedness are defined as nodes and edges, respectively.
We demonstrate that the proposed method significantly outperforms the state-of-the-art methods and verify its efficacy through extensive ablation studies.
arXiv Detail & Related papers (2022-11-02T12:38:08Z) - Scene Graph Generation: A Comprehensive Survey [35.80909746226258]
Scene graph has been the focus of research because of its powerful semantic representation and applications to scene understanding.
Scene Graph Generation (SGG) refers to the task of automatically mapping an image into a semantic structural scene graph.
We review 138 representative works that cover different input modalities, and systematically summarize existing methods of image-based SGG.
arXiv Detail & Related papers (2022-01-03T00:55:33Z) - Vectorization and Rasterization: Self-Supervised Learning for Sketch and
Handwriting [168.91748514706995]
We propose two novel cross-modal translation pre-text tasks for self-supervised feature learning: Vectorization and Rasterization.
Our learned encoder modules benefit both-based and vector-based downstream approaches to analysing hand-drawn data.
arXiv Detail & Related papers (2021-03-25T09:47:18Z) - Exploring Bottom-up and Top-down Cues with Attentive Learning for Webly
Supervised Object Detection [76.9756607002489]
We propose a novel webly supervised object detection (WebSOD) method for novel classes.
Our proposed method combines bottom-up and top-down cues for novel class detection.
We demonstrate our proposed method on PASCAL VOC dataset with three different novel/base splits.
arXiv Detail & Related papers (2020-03-22T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.