Word length-aware text spotting: Enhancing detection and recognition in
dense text image
- URL: http://arxiv.org/abs/2312.15690v1
- Date: Mon, 25 Dec 2023 10:46:20 GMT
- Title: Word length-aware text spotting: Enhancing detection and recognition in
dense text image
- Authors: Hao Wang, Huabing Zhou, Yanduo Zhang, Tao Lu and Jiayi Ma
- Abstract summary: We present WordLenSpotter, a novel word length-aware spotter for scene text image detection and recognition.
We improve the spotting capabilities for long and short words, particularly in the tail data of dense text images.
- Score: 33.44340604133642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text spotting is essential in various computer vision applications,
enabling extracting and interpreting textual information from images. However,
existing methods often neglect the spatial semantics of word images, leading to
suboptimal detection recall rates for long and short words within long-tailed
word length distributions that exist prominently in dense scenes. In this
paper, we present WordLenSpotter, a novel word length-aware spotter for scene
text image detection and recognition, improving the spotting capabilities for
long and short words, particularly in the tail data of dense text images. We
first design an image encoder equipped with a dilated convolutional fusion
module to integrate multiscale text image features effectively. Then,
leveraging the Transformer framework, we synergistically optimize text
detection and recognition accuracy after iteratively refining text region image
features using the word length prior. Specially, we design a Spatial Length
Predictor module (SLP) using character count prior tailored to different word
lengths to constrain the regions of interest effectively. Furthermore, we
introduce a specialized word Length-aware Segmentation (LenSeg) proposal head,
enhancing the network's capacity to capture the distinctive features of long
and short terms within categories characterized by long-tailed distributions.
Comprehensive experiments on public datasets and our dense text spotting
dataset DSTD1500 demonstrate the superiority of our proposed methods,
particularly in dense text image detection and recognition tasks involving
long-tailed word length distributions encompassing a range of long and short
words.
Related papers
- LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
Long text understanding is of great demands in language-image pre-training models.
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We validate the effectiveness of our approach using a self-constructed large-scale dataset.
It is noteworthy that, on the task of long-text image retrieval, we beat the competitor using long captions with 11.1% improvement.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Out of Length Text Recognition with Sub-String Matching [54.63761108308825]
In this paper, we term this task Out of Length (OOL) text recognition.
We propose a novel method called OOL Text Recognition with sub-String Matching (SMTR)
SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features.
arXiv Detail & Related papers (2024-07-17T05:02:17Z) - GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching [77.0306273129475]
Video text spotting presents an augmented challenge with the inclusion of tracking.
GoMatching focuses the training efforts on tracking while maintaining strong recognition performance.
GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test with arbitrary-shaped text termed ArTVideo.
arXiv Detail & Related papers (2024-01-13T13:59:15Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition [27.280917081410955]
We propose a method called Length-Insensitive Scene TExt Recognizer (LISTER)
A Neighbor Decoder is proposed to obtain accurate character attention maps with the assistance of a novel neighbor matrix.
A Feature Enhancement Module is devised to model the long-range dependency with low cost.
arXiv Detail & Related papers (2023-08-24T13:26:18Z) - Semantic-Preserving Augmentation for Robust Image-Text Retrieval [27.2916415148638]
RVSE consists of novel image-based and text-based augmentation techniques called semantic preserving augmentation for image (SPAugI) and text (SPAugT)
Since SPAugI and SPAugT change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic aware embedding vectors.
From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.
arXiv Detail & Related papers (2023-03-10T03:50:44Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - IDEA: Increasing Text Diversity via Online Multi-Label Recognition for
Vision-Language Pre-training [18.898969509263804]
IDEA stands for increasing text diversity via online multi-label recognition for Vision-Language Pre-training.
We show that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.
arXiv Detail & Related papers (2022-07-12T06:14:27Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding [6.4901484665257545]
We propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data.
Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets.
arXiv Detail & Related papers (2020-01-11T05:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.