Scalable Mask Annotation for Video Text Spotting
- URL: http://arxiv.org/abs/2305.01443v1
- Date: Tue, 2 May 2023 14:18:45 GMT
- Title: Scalable Mask Annotation for Video Text Spotting
- Authors: Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao
- Abstract summary: We propose a scalable mask annotation pipeline called SAMText for video text spotting.
Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips and over 9 million mask annotations.
- Score: 86.72547285886183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video text spotting refers to localizing, recognizing, and tracking textual
elements such as captions, logos, license plates, signs, and other forms of
text within consecutive video frames. However, current datasets available for
this task rely on quadrilateral ground truth annotations, which may result in
including excessive background content and inaccurate text boundaries.
Furthermore, methods trained on these datasets often produce prediction results
in the form of quadrilateral boxes, which limits their ability to handle
complex scenarios such as dense or curved text. To address these issues, we
propose a scalable mask annotation pipeline called SAMText for video text
spotting. SAMText leverages the SAM model to generate mask annotations for
scene text images or video frames at scale. Using SAMText, we have created a
large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced
from existing datasets and over 9 million mask annotations. We have also
conducted a thorough statistical analysis of the generated masks and their
quality, identifying several research topics that could be further explored
based on this dataset. The code and dataset will be released at
\url{https://github.com/ViTAE-Transformer/SAMText}.
Related papers
- Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation [62.56037816595509]
Mask$2$DiT establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations.
This attention mechanism enables precise segment-level textual-to-visual alignment.
Mask$2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
arXiv Detail & Related papers (2025-03-25T17:46:50Z) - Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts [12.444549174054988]
Char-SAM is a pipeline that turns SAM into a low-cost segmentation annotator with a character-level visual prompt.
Char-SAM generates high-quality scene text segmentation annotations automatically.
Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
arXiv Detail & Related papers (2024-12-27T20:33:39Z) - SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in
Videos by Prompt Denoising [37.216493829454706]
We explore the potential of applying the Segment Anything Model to track and segment objects in videos.
Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame.
To enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy.
arXiv Detail & Related papers (2024-03-07T03:52:59Z) - Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation [97.90960864892966]
This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation.
Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph.
Compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements.
arXiv Detail & Related papers (2024-01-31T15:10:29Z) - Self-supervised Scene Text Segmentation with Object-centric Layered
Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background.
On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Rethinking Text Segmentation: A Novel Dataset and A Text-Specific
Refinement Approach [34.63444886780274]
Text segmentation is a prerequisite in real-world text-related tasks.
We introduce Text Refinement Network (TexRNet), a novel text segmentation approach.
TexRNet consistently improves text segmentation performance by nearly 2% compared to other state-of-the-art segmentation methods.
arXiv Detail & Related papers (2020-11-27T22:50:09Z) - Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text
Spotting [71.6244869235243]
Most arbitrary-shape scene text spotters use region proposal networks (RPN) to produce proposals.
Our Mask TextSpotter v3 can handle text instances of extreme aspect ratios or irregular shapes, and its recognition accuracy won't be affected by nearby text or background noise.
arXiv Detail & Related papers (2020-07-18T17:25:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.