Video Text Tracking With a Spatio-Temporal Complementary Model
- URL: http://arxiv.org/abs/2111.04987v1
- Date: Tue, 9 Nov 2021 08:23:06 GMT
- Title: Video Text Tracking With a Spatio-Temporal Complementary Model
- Authors: Yuzhe Gao, Xing Li, Jiajian Zhang, Yu Zhou, Dian Jin, Jing Wang,
Shenggao Zhu, and Xiang Bai
- Abstract summary: Text tracking is to track multiple texts in a video,and construct a trajectory for each text.
Existing methodle this task by utilizing the tracking-by-detection frame-work.
We argue that the tracking accuracy of this paradigmis severely limited in more complex scenarios.
- Score: 46.99051486905713
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text tracking is to track multiple texts in a video,and construct a
trajectory for each text. Existing methodstackle this task by utilizing the
tracking-by-detection frame-work, i.e., detecting the text instances in each
frame andassociating the corresponding text instances in consecutiveframes. We
argue that the tracking accuracy of this paradigmis severely limited in more
complex scenarios, e.g., owing tomotion blur, etc., the missed detection of
text instances causesthe break of the text trajectory. In addition, different
textinstances with similar appearance are easily confused, leadingto the
incorrect association of the text instances. To this end,a novel
spatio-temporal complementary text tracking model isproposed in this paper. We
leverage a Siamese ComplementaryModule to fully exploit the continuity
characteristic of the textinstances in the temporal dimension, which
effectively alleviatesthe missed detection of the text instances, and hence
ensuresthe completeness of each text trajectory. We further integratethe
semantic cues and the visual cues of the text instance intoa unified
representation via a text similarity learning network,which supplies a high
discriminative power in the presence oftext instances with similar appearance,
and thus avoids the mis-association between them. Our method achieves
state-of-the-art performance on several public benchmarks. The source codeis
available at https://github.com/lsabrinax/VideoTextSCM.
Related papers
- Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis [52.34110239735265]
We present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis.
Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance.
arXiv Detail & Related papers (2024-05-13T05:48:35Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Towards Unified Scene Text Spotting based on Sequence Generation [4.437335677401287]
We propose a UNIfied scene Text Spotter, called UNITS.
Our model unifies various detection formats, including quadrilaterals and polygons.
We apply starting-point prompting to enable the model to extract texts from an arbitrary starting point.
arXiv Detail & Related papers (2023-04-07T01:28:08Z) - Contextual Text Block Detection towards Scene Text Understanding [85.40898487745272]
This paper presents contextual text detection, a new setup that detects contextual text blocks (CTBs) for better understanding of texts in scenes.
We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB.
To this end, we design a novel scene text clustering technique that treats integral text units as tokens and groups them (belonging to the same CTB) into an ordered token sequence.
arXiv Detail & Related papers (2022-07-26T14:59:25Z) - CORE-Text: Improving Scene Text Detection with Contrastive Relational
Reasoning [65.57338873921168]
Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision.
In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module.
We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text.
arXiv Detail & Related papers (2021-12-14T16:22:25Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.