Real-time End-to-End Video Text Spotter with Contrastive Representation
Learning
- URL: http://arxiv.org/abs/2207.08417v1
- Date: Mon, 18 Jul 2022 07:54:17 GMT
- Title: Real-time End-to-End Video Text Spotter with Contrastive Representation
Learning
- Authors: Wejia Wu, Zhuang Li, Jiahong Li, Chunhua Shen, Hong Zhou, Size Li,
Zhongyuan Wang, and Ping Luo
- Abstract summary: We propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText)
CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework.
A simple, lightweight architecture is designed for effective and accurate performance.
- Score: 91.15406440999939
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video text spotting(VTS) is the task that requires simultaneously detecting,
tracking and recognizing text in the video. Existing video text spotting
methods typically develop sophisticated pipelines and multiple models, which is
not friend for real-time applications. Here we propose a real-time end-to-end
video text spotter with Contrastive Representation learning (CoText). Our
contributions are three-fold: 1) CoText simultaneously address the three tasks
(e.g., text detection, tracking, recognition) in a real-time end-to-end
trainable framework. 2) With contrastive learning, CoText models long-range
dependencies and learning temporal information across multiple frames. 3) A
simple, lightweight architecture is designed for effective and accurate
performance, including GPU-parallel detection post-processing, CTC-based
recognition head with Masked RoI. Extensive experiments show the superiority of
our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at
41.0 FPS on ICDAR2015video, with 10.5% and 32.0 FPS improvement the previous
best method. The code can be found at github.com/weijiawu/CoText.
Related papers
- Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - FlowText: Synthesizing Realistic Scene Text Video with Optical Flow
Estimation [23.080145300304018]
This paper introduces a novel video text synthesis technique called FlowText.
It synthesizes a large amount of text video data at a low cost for training robust video text spotters.
arXiv Detail & Related papers (2023-05-05T07:15:49Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - End-to-End Video Text Spotting with Transformer [86.46724646835627]
We propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR)
TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition)
arXiv Detail & Related papers (2022-03-20T12:14:58Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Contrastive Learning of Semantic and Visual Representations for Text
Tracking [22.817884815010856]
In this paper, we explore to robustly track video text with contrastive learning of semantic and visual representations.
We present an end-to-end video text tracker with Semantic and Visual Representations(SVRep), which detects and tracks texts by exploiting the visual and semantic relationships between different texts in a video sequence.
With a backbone of ResNet-18, SVRep achieves an $rm ID_F1$ of $textbf65.9%$, running at $textbf16.7$ FPS, on the ICDAR2015(video) dataset.
arXiv Detail & Related papers (2021-12-30T09:22:13Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.