Related papers: Hyper-Local Deformable Transformers for Text Spotting on Historical Maps

Hyper-Local Deformable Transformers for Text Spotting on Historical Maps

URL: http://arxiv.org/abs/2506.15010v1
Date: Tue, 17 Jun 2025 22:41:10 GMT
Title: Hyper-Local Deformable Transformers for Text Spotting on Historical Maps
Authors: Yijun Lin, Yao-Yi Chiang,
Abstract summary: Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts.<n>Previous approaches use ad-hoc steps tailored to only specific map styles.<n>Recent machine learning-based text spotters have the potential to solve these challenges.<n>This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps.
Score: 2.423679070137552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at https://github.com/kartta-foundation/mapkurator-palette-doc.

Related papers

LIGHT: Multi-Modal Text Linking on Historical Maps [1.8399976559754367]
Light is a novel multi-modal approach that integrates linguistic, image, and geometric features for linking text on historical maps.<n>It outperforms existing methods on the ICDAR 2024/2025 MapText Competition data.
arXiv Detail & Related papers (2025-06-27T19:18:00Z)
MapExplorer: New Content Generation from Low-Dimensional Visualizations [60.02149343347818]
Low-dimensional visualizations, or "projection maps," are widely used to interpret large-scale and complex datasets.<n>These visualizations not only aid in understanding existing knowledge spaces but also implicitly guide exploration into unknown areas.<n>We introduce MapExplorer, a novel knowledge discovery task that translates coordinates within any projection map into coherent, contextually aligned textual content.
arXiv Detail & Related papers (2024-12-24T20:16:13Z)
Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt [10.17947324152468]
Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching.
arXiv Detail & Related papers (2024-09-20T15:24:26Z)
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images. We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z)
The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps [7.209761597734092]
mapKurator is an end-to-end system integrating machine learning models with a comprehensive data processing pipeline. We deployed the mapKurator system and enabled the processing of over 60,000 maps and over 100 million text/place names in the David Rumsey Historical Map collection.
arXiv Detail & Related papers (2023-06-29T16:05:40Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Expressive Text-to-Image Generation with Rich Text [42.923053338525804]
We propose a rich-text editor supporting formats such as font style, size, color, and footnote.<n>We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis.
arXiv Detail & Related papers (2023-04-13T17:59:55Z)
Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z)
Synthetic Map Generation to Provide Unlimited Training Data for Historical Map Text Detection [5.872532529455414]
We propose a method to automatically generate an unlimited amount of annotated historical map images for training text detection models. We show that the state-of-the-art text detection models can benefit from the synthetic historical maps.
arXiv Detail & Related papers (2021-12-12T00:27:03Z)
An Automatic Approach for Generating Rich, Linked Geo-Metadata from Historical Map Images [6.962949867017594]
This paper presents an end-to-end approach to address the real-world problem of finding and indexing historical map images. We have implemented the approach in a system called mapKurator.
arXiv Detail & Related papers (2021-12-03T01:44:38Z)
Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text. We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images. In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.