I3CL:Intra- and Inter-Instance Collaborative Learning for
Arbitrary-shaped Scene Text Detection
- URL: http://arxiv.org/abs/2108.01343v1
- Date: Tue, 3 Aug 2021 07:48:12 GMT
- Title: I3CL:Intra- and Inter-Instance Collaborative Learning for
Arbitrary-shaped Scene Text Detection
- Authors: Jian Ye, Jing Zhang, Juhua Liu, Bo Du and Dacheng Tao
- Abstract summary: We propose a novel method named Intra- and Inter-Instance Collaborative Learning (I3CL)
Specifically, to address the first issue, we design an effective convolutional module with multiple receptive fields.
To address the second issue, we devise an instance-based transformer module to exploit the dependencies between different text instances.
- Score: 93.62705504233931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for arbitrary-shaped text detection in natural scenes face
two critical issues, i.e., 1) fracture detections at the gaps in a text
instance; and 2) inaccurate detections of arbitrary-shaped text instances with
diverse background context. To address these issues, we propose a novel method
named Intra- and Inter-Instance Collaborative Learning (I3CL). Specifically, to
address the first issue, we design an effective convolutional module with
multiple receptive fields, which is able to collaboratively learn better
character and gap feature representations at local and long ranges inside a
text instance. To address the second issue, we devise an instance-based
transformer module to exploit the dependencies between different text instances
and a pixel-based transformer module to exploit the global context from the
shared background, which are able to collaboratively learn more discriminative
text feature representations. In this way, I3CL can effectively exploit the
intra- and inter-instance dependencies together in a unified end-to-end
trainable framework. Experimental results show that the proposed I3CL sets new
state-of-the-art performances on three challenging public benchmarks, i.e., an
F-measure of 76.4% on ICDAR2019-ArT, 86.2% on Total-Text, and 85.8% on
CTW-1500. Besides, I3CL with ResNeSt-101 backbone ranked 1st place on the
ICDAR2019-ArT leaderboard. The source code will be made publicly available.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Towards Robust Real-Time Scene Text Detection: From Semantic to Instance
Representation Learning [19.856492291263102]
We propose representation learning for real-time scene text detection.
For semantic representation learning, we propose global-dense semantic contrast (GDSC) and top-down modeling (TDM)
With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any parameters and computations during inference.
The proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500.
arXiv Detail & Related papers (2023-08-14T15:14:37Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - AlignSeg: Feature-Aligned Segmentation Networks [109.94809725745499]
We propose Feature-Aligned Networks (AlignSeg) to address misalignment issues during the feature aggregation process.
Our network achieves new state-of-the-art mIoU scores of 82.6% and 45.95%, respectively.
arXiv Detail & Related papers (2020-02-24T10:00:58Z) - A New Perspective for Flexible Feature Gathering in Scene Text
Recognition Via Character Anchor Pooling [32.82620509088932]
We propose a pair of coupling modules, termed as Character Anchoring Module (CAM) and Anchor Pooling Module (APM)
CAM localizes the text in a shape-insensitive way by design by anchoring characters individually. APM then interpolates and gathers features flexibly along the character anchors which enables sequence learning.
arXiv Detail & Related papers (2020-02-10T03:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.