End-to-End Video Text Spotting with Transformer
- URL: http://arxiv.org/abs/2203.10539v1
- Date: Sun, 20 Mar 2022 12:14:58 GMT
- Title: End-to-End Video Text Spotting with Transformer
- Authors: Weijia Wu, Debing Zhang, Ying Fu, Chunhua Shen, Hong Zhou, Yuanqiang
Cai, Ping Luo
- Abstract summary: We propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR)
TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition)
- Score: 86.46724646835627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent video text spotting methods usually require the three-staged pipeline,
i.e., detecting text in individual images, recognizing localized text, tracking
text streams with post-processing to generate final results. These methods
typically follow the tracking-by-match paradigm and develop sophisticated
pipelines. In this paper, rooted in Transformer sequence modeling, we propose a
simple, but effective end-to-end video text DEtection, Tracking, and
Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1)
Different from the explicit match paradigm in the adjacent frame, TransDETR
tracks and recognizes each text implicitly by the different query termed text
query over long-range temporal sequence (more than 7 frames). 2) TransDETR is
the first end-to-end trainable video text spotting framework, which
simultaneously addresses the three sub-tasks (e.g., text detection, tracking,
recognition). Extensive experiments in four video text datasets (i.e.,ICDAR2013
Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to
demonstrate that TransDETR achieves state-of-the-art performance with up to
around 8.0% improvements on video text spotting tasks. The code of TransDETR
can be found at https://github.com/weijiawu/TransDETR.
Related papers
- GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching [77.0306273129475]
Video text spotting presents an augmented challenge with the inclusion of tracking.
GoMatching focuses the training efforts on tracking while maintaining strong recognition performance.
GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test with arbitrary-shaped text termed ArTVideo.
arXiv Detail & Related papers (2024-01-13T13:59:15Z) - Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture.
In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection.
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Real-time End-to-End Video Text Spotter with Contrastive Representation
Learning [91.15406440999939]
We propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText)
CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework.
A simple, lightweight architecture is designed for effective and accurate performance.
arXiv Detail & Related papers (2022-07-18T07:54:17Z) - Contrastive Learning of Semantic and Visual Representations for Text
Tracking [22.817884815010856]
In this paper, we explore to robustly track video text with contrastive learning of semantic and visual representations.
We present an end-to-end video text tracker with Semantic and Visual Representations(SVRep), which detects and tracks texts by exploiting the visual and semantic relationships between different texts in a video sequence.
With a backbone of ResNet-18, SVRep achieves an $rm ID_F1$ of $textbf65.9%$, running at $textbf16.7$ FPS, on the ICDAR2015(video) dataset.
arXiv Detail & Related papers (2021-12-30T09:22:13Z) - End-to-End Referring Video Object Segmentation with Multimodal
Transformers [0.0]
We propose a simple Transformer-based approach to the referring video object segmentation task.
Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem.
MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps.
arXiv Detail & Related papers (2021-11-29T18:59:32Z) - Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting [49.768327669098674]
We propose an end-to-end trainable text spotting approach named Text Perceptron.
It first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information.
Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies.
arXiv Detail & Related papers (2020-02-17T08:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.