Related papers: Scalable Mask Annotation for Video Text Spotting

Scalable Mask Annotation for Video Text Spotting

URL: http://arxiv.org/abs/2305.01443v1
Date: Tue, 2 May 2023 14:18:45 GMT
Title: Scalable Mask Annotation for Video Text Spotting
Authors: Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao
Abstract summary: We propose a scalable mask annotation pipeline called SAMText for video text spotting. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips and over 9 million mask annotations.
Score: 86.72547285886183
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video text spotting refers to localizing, recognizing, and tracking textual elements such as captions, logos, license plates, signs, and other forms of text within consecutive video frames. However, current datasets available for this task rely on quadrilateral ground truth annotations, which may result in including excessive background content and inaccurate text boundaries. Furthermore, methods trained on these datasets often produce prediction results in the form of quadrilateral boxes, which limits their ability to handle complex scenarios such as dense or curved text. To address these issues, we propose a scalable mask annotation pipeline called SAMText for video text spotting. SAMText leverages the SAM model to generate mask annotations for scene text images or video frames at scale. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced from existing datasets and over 9 million mask annotations. We have also conducted a thorough statistical analysis of the generated masks and their quality, identifying several research topics that could be further explored based on this dataset. The code and dataset will be released at \url{https://github.com/ViTAE-Transformer/SAMText}.

Related papers

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder [5.57393627015653]
Video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models.<n>This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy.<n>We propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2.
arXiv Detail & Related papers (2025-06-28T13:30:36Z)
Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation [62.56037816595509]
Mask$2$DiT establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. This attention mechanism enables precise segment-level textual-to-visual alignment. Mask$2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
arXiv Detail & Related papers (2025-03-25T17:46:50Z)
Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts [12.444549174054988]
Char-SAM is a pipeline that turns SAM into a low-cost segmentation annotator with a character-level visual prompt. Char-SAM generates high-quality scene text segmentation annotations automatically. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
arXiv Detail & Related papers (2024-12-27T20:33:39Z)
SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising [37.216493829454706]
We explore the potential of applying the Segment Anything Model to track and segment objects in videos. Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame. To enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy.
arXiv Detail & Related papers (2024-03-07T03:52:59Z)
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation [97.90960864892966]
This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph. Compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements.
arXiv Detail & Related papers (2024-01-31T15:10:29Z)
Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background. On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z)
TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs. We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z)
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z)
Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z)
Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach [34.63444886780274]
Text segmentation is a prerequisite in real-world text-related tasks. We introduce Text Refinement Network (TexRNet), a novel text segmentation approach. TexRNet consistently improves text segmentation performance by nearly 2% compared to other state-of-the-art segmentation methods.
arXiv Detail & Related papers (2020-11-27T22:50:09Z)
Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting [71.6244869235243]
Most arbitrary-shape scene text spotters use region proposal networks (RPN) to produce proposals. Our Mask TextSpotter v3 can handle text instances of extreme aspect ratios or irregular shapes, and its recognition accuracy won't be affected by nearby text or background noise.
arXiv Detail & Related papers (2020-07-18T17:25:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.