TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification
- URL: http://arxiv.org/abs/2503.06501v1
- Date: Sun, 09 Mar 2025 08:03:41 GMT
- Title: TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification
- Authors: Huaqi Tao, Bingxi Liu, Calvin Chen, Tingjun Huang, He Li, Jinqiang Cui, Hong Zhang,
- Abstract summary: TextInPlace is a framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments.<n>To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor Visual Place Recognition benchmark dataset.
- Score: 6.113831719528347
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene text typically appears in indoor spaces, serving to distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene text. Finally, the discriminative text is filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.
Related papers
- Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance [6.93632116687419]
Local semantic knowledge not only includes text content but also spatial information in the right reading order.<n>We propose the Local Semantics Guided scene text Spotter (LSGSpotter), which auto-regressively decodes the position and content of characters guided by the local semantics.<n>LSGSpotter achieves the arbitrary reading order spotting task without the limitation of sophisticated detection.
arXiv Detail & Related papers (2024-12-13T14:20:43Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards
Enhancing Text Spotting Performance [15.513912470752041]
The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions.
Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data.
The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains.
arXiv Detail & Related papers (2023-10-02T06:08:01Z) - Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes [11.478236584340255]
We present a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes.
We also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter.
The dataset, code and pre-trained models will be released upon acceptance.
arXiv Detail & Related papers (2023-10-01T03:27:41Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in
Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection.
However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model.
We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z) - Text Detection & Recognition in the Wild for Robot Localization [1.52292571922932]
We propose an end-to-end scene text spotting model that simultaneously outputs the text string and bounding boxes.
Our central contribution is introducing utilizing an end-to-end scene text spotting framework to adequately capture the irregular and occluded text regions.
arXiv Detail & Related papers (2022-05-17T18:16:34Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - SwinTextSpotter: Scene Text Spotting via Better Synergy between Text
Detection and Text Recognition [73.61592015908353]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter.
Using a transformer with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism.
The design results in a concise framework that requires neither additional rectification module nor character-level annotation.
arXiv Detail & Related papers (2022-03-19T01:14:42Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.