Tracking Based Semi-Automatic Annotation for Scene Text Videos
- URL: http://arxiv.org/abs/2103.15488v1
- Date: Mon, 29 Mar 2021 10:42:23 GMT
- Title: Tracking Based Semi-Automatic Annotation for Scene Text Videos
- Authors: Jiajun Zhu, Xiufeng Jiang, Zhiwei Jia, Shugong Xu, Shan Cao
- Abstract summary: Existing scene text video datasets are not large-scale due to the expensive cost caused by manual labeling.
We get semi-automatic scene text annotation by labeling manually for the first frame and tracking automatically for the subsequent frames.
A paired low-quality scene text video dataset named Text-RBL is proposed, consisting of raw videos, blurry videos, and low-resolution videos.
- Score: 16.286021899032274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, video scene text detection has received increasing attention due to
its comprehensive applications. However, the lack of annotated scene text video
datasets has become one of the most important problems, which hinders the
development of video scene text detection. The existing scene text video
datasets are not large-scale due to the expensive cost caused by manual
labeling. In addition, the text instances in these datasets are too clear to be
a challenge. To address the above issues, we propose a tracking based
semi-automatic labeling strategy for scene text videos in this paper. We get
semi-automatic scene text annotation by labeling manually for the first frame
and tracking automatically for the subsequent frames, which avoid the huge cost
of manual labeling. Moreover, a paired low-quality scene text video dataset
named Text-RBL is proposed, consisting of raw videos, blurry videos, and
low-resolution videos, labeled by the proposed convenient semi-automatic
labeling strategy. Through an averaging operation and bicubic down-sampling
operation over the raw videos, we can efficiently obtain blurry videos and
low-resolution videos paired with raw videos separately. To verify the
effectiveness of Text-RBL, we propose a baseline model combined with the text
detector and tracker for video scene text detection. Moreover, a failure
detection scheme is designed to alleviate the baseline model drift issue caused
by complex scenes. Extensive experiments demonstrate that Text-RBL with paired
low-quality videos labeled by the semi-automatic method can significantly
improve the performance of the text detector in low-quality scenes.
Related papers
- VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework.
It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback.
VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z) - ScreenWriter: Automatic Screenplay Generation and Movie Summarisation [55.20132267309382]
Video content has driven demand for textual descriptions or summaries that allow users to recall key plot points or get an overview without watching.
We propose the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions.
ScreenWriter introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors, and a novel method for the challenging problem of determining character names, based on a database of actors' faces.
arXiv Detail & Related papers (2024-10-17T07:59:54Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Text in the Dark: Extremely Low-Light Text Image Enhancement [20.631833980353704]
Low-light text images are common in natural scenes, making scene text detection and recognition challenging.
We propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement.
Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction.
arXiv Detail & Related papers (2024-04-22T12:39:12Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - FlowText: Synthesizing Realistic Scene Text Video with Optical Flow
Estimation [23.080145300304018]
This paper introduces a novel video text synthesis technique called FlowText.
It synthesizes a large amount of text video data at a low cost for training robust video text spotters.
arXiv Detail & Related papers (2023-05-05T07:15:49Z) - Video text tracking for dense and small text based on pp-yoloe-r and
sort algorithm [0.9137554315375919]
DSText is 1080 * 1920 and slicing the video frame into several areas will destroy the spatial correlation of text.
For text detection, we adopt the PP-YOLOE-R which is proven effective in small object detection.
For text detection, we use the sort algorithm for high inference speed.
arXiv Detail & Related papers (2023-03-31T05:40:39Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Scene Text Detection with Scribble Lines [59.698806258671105]
We propose to annotate texts by scribble lines instead of polygons for text detection.
It is a general labeling method for texts with various shapes and requires low labeling costs.
Experiments show that the proposed method bridges the performance gap between the weakly labeling method and the original polygon-based labeling methods.
arXiv Detail & Related papers (2020-12-09T13:14:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.