Related papers: Video text tracking for dense and small text based on pp-yoloe-r and sort algorithm

Video text tracking for dense and small text based on pp-yoloe-r and sort algorithm

URL: http://arxiv.org/abs/2304.00018v1
Date: Fri, 31 Mar 2023 05:40:39 GMT
Title: Video text tracking for dense and small text based on pp-yoloe-r and sort algorithm
Authors: Hongen Liu
Abstract summary: DSText is 1080 * 1920 and slicing the video frame into several areas will destroy the spatial correlation of text. For text detection, we adopt the PP-YOLOE-R which is proven effective in small object detection. For text detection, we use the sort algorithm for high inference speed.
Score: 0.9137554315375919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although end-to-end video text spotting methods based on Transformer can model long-range dependencies and simplify the train process, it will lead to large computation cost with the increase of the frame size in the input video. Therefore, considering the resolution of ICDAR 2023 DSText is 1080 * 1920 and slicing the video frame into several areas will destroy the spatial correlation of text, we divided the small and dense text spotting into two tasks, text detection and tracking. For text detection, we adopt the PP-YOLOE-R which is proven effective in small object detection as our detection model. For text detection, we use the sort algorithm for high inference speed. Experiments on DSText dataset demonstrate that our method is competitive on small and dense text spotting.

Related papers

Seeing Text in the Dark: Algorithm and Benchmark [28.865779563872977]
In this work, we propose an efficient and effective single-stage approach for localizing text in dark. We introduce a constrained learning module as an auxiliary mechanism during the training stage of the text detector. We present a comprehensive low-light dataset for arbitrary-shaped text, encompassing diverse scenes and languages.
arXiv Detail & Related papers (2024-04-13T11:07:10Z)
Text Region Multiple Information Perception Network for Scene Text Detection [19.574306663095243]
This paper proposes a plug-and-play module called the Region Multiple Information Perception Module (RMIPM) to enhance the detection performance of segmentation-based algorithms. Specifically, we design an improved module that can perceive various types of information about scene text regions, such as text foreground classification maps, distance maps, direction maps, etc.
arXiv Detail & Related papers (2024-01-18T14:36:51Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z)
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection. However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model. We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z)
Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z)
On Exploring and Improving Robustness of Scene Text Detection Models [20.15225372544634]
We evaluate scene text detection models ICDAR2015-C (IC15-C) and CTW1500-C (CTW-C) We perform a robustness analysis of six key components: pre-training data, backbone, feature fusion module, multi-scale predictions, representation of text instances and loss function. We present a simple yet effective data-based method to destroy the smoothness of text regions by merging background and foreground.
arXiv Detail & Related papers (2021-10-12T02:36:48Z)
RayNet: Real-time Scene Arbitrary-shape Text Detection with Multiple Rays [84.15123599963239]
We propose a novel detection framework for arbitrary-shape text detection, termed as RayNet. RayNet uses Center Point Set (CPS) and Ray Distance (RD) to fit text, where CPS is used to determine the text general position and the RD is combined with CPS to compute Ray Points (RP) to localize the text accurate shape. RayNet achieves impressive performance on existing curved text dataset (CTW1500) and quadrangle text dataset (ICDAR2015)
arXiv Detail & Related papers (2021-04-11T03:03:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.