Related papers: Tracking Based Semi-Automatic Annotation for Scene Text Videos

Tracking Based Semi-Automatic Annotation for Scene Text Videos

URL: http://arxiv.org/abs/2103.15488v1
Date: Mon, 29 Mar 2021 10:42:23 GMT
Title: Tracking Based Semi-Automatic Annotation for Scene Text Videos
Authors: Jiajun Zhu, Xiufeng Jiang, Zhiwei Jia, Shugong Xu, Shan Cao
Abstract summary: Existing scene text video datasets are not large-scale due to the expensive cost caused by manual labeling. We get semi-automatic scene text annotation by labeling manually for the first frame and tracking automatically for the subsequent frames. A paired low-quality scene text video dataset named Text-RBL is proposed, consisting of raw videos, blurry videos, and low-resolution videos.
Score: 16.286021899032274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, video scene text detection has received increasing attention due to its comprehensive applications. However, the lack of annotated scene text video datasets has become one of the most important problems, which hinders the development of video scene text detection. The existing scene text video datasets are not large-scale due to the expensive cost caused by manual labeling. In addition, the text instances in these datasets are too clear to be a challenge. To address the above issues, we propose a tracking based semi-automatic labeling strategy for scene text videos in this paper. We get semi-automatic scene text annotation by labeling manually for the first frame and tracking automatically for the subsequent frames, which avoid the huge cost of manual labeling. Moreover, a paired low-quality scene text video dataset named Text-RBL is proposed, consisting of raw videos, blurry videos, and low-resolution videos, labeled by the proposed convenient semi-automatic labeling strategy. Through an averaging operation and bicubic down-sampling operation over the raw videos, we can efficiently obtain blurry videos and low-resolution videos paired with raw videos separately. To verify the effectiveness of Text-RBL, we propose a baseline model combined with the text detector and tracker for video scene text detection. Moreover, a failure detection scheme is designed to alleviate the baseline model drift issue caused by complex scenes. Extensive experiments demonstrate that Text-RBL with paired low-quality videos labeled by the semi-automatic method can significantly improve the performance of the text detector in low-quality scenes.

Related papers

Text-Video Multi-Grained Integration for Video Moment Montage [13.794791614348084]
A new task called Video Moment Montage (VMM) aims to accurately locate the corresponding video segments based on a pre-provided narration text. We present a novel textitText-Video Multi-Grained Integration method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features.
arXiv Detail & Related papers (2024-12-12T13:40:59Z)
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework. It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback. VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z)
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation [55.20132267309382]
Video content has driven demand for textual descriptions or summaries that allow users to recall key plot points or get an overview without watching. We propose the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions. ScreenWriter introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors, and a novel method for the challenging problem of determining character names, based on a database of actors' faces.
arXiv Detail & Related papers (2024-10-17T07:59:54Z)
Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos. We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z)
Text in the Dark: Extremely Low-Light Text Image Enhancement [20.631833980353704]
Low-light text images are common in natural scenes, making scene text detection and recognition challenging. We propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction.
arXiv Detail & Related papers (2024-04-22T12:39:12Z)
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos. To improve generalization, we show that one model can be trained with multiple text styles. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z)
FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation [23.080145300304018]
This paper introduces a novel video text synthesis technique called FlowText. It synthesizes a large amount of text video data at a low cost for training robust video text spotters.
arXiv Detail & Related papers (2023-05-05T07:15:49Z)
Video text tracking for dense and small text based on pp-yoloe-r and sort algorithm [0.9137554315375919]
DSText is 1080 * 1920 and slicing the video frame into several areas will destroy the spatial correlation of text. For text detection, we adopt the PP-YOLOE-R which is proven effective in small object detection. For text detection, we use the sort algorithm for high inference speed.
arXiv Detail & Related papers (2023-03-31T05:40:39Z)
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided. We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video. Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z)
Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames. It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z)
Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information. We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information. We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z)
Scene Text Detection with Scribble Lines [59.698806258671105]
We propose to annotate texts by scribble lines instead of polygons for text detection. It is a general labeling method for texts with various shapes and requires low labeling costs. Experiments show that the proposed method bridges the performance gap between the weakly labeling method and the original polygon-based labeling methods.
arXiv Detail & Related papers (2020-12-09T13:14:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.