Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
- URL: http://arxiv.org/abs/2401.17904v2
- Date: Fri, 08 Nov 2024 10:45:12 GMT
- Title: Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
- Authors: Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, Dacheng Tao,
- Abstract summary: This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation.
Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph.
Compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements.
- Score: 97.90960864892966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.
Related papers
- Vision and Language Reference Prompt into SAM for Few-shot Segmentation [1.9458156037869137]
Segment Anything Model (SAM) is a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts.
Few-shot segmentation models address these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts.
We propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM, that utilizes the visual information of the reference images and the semantic information of the text labels.
arXiv Detail & Related papers (2025-02-02T08:40:14Z) - Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts [12.444549174054988]
Char-SAM is a pipeline that turns SAM into a low-cost segmentation annotator with a character-level visual prompt.
Char-SAM generates high-quality scene text segmentation annotations automatically.
Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
arXiv Detail & Related papers (2024-12-27T20:33:39Z) - Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments.
We propose UOIS-SAM, a data-efficient solution for the UOIS task.
UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z) - SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation [88.80792308991867]
Segment Anything model (SAM) has shown ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges.
This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation.
Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains.
arXiv Detail & Related papers (2024-07-23T17:47:25Z) - WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models [43.27699553774037]
We propose Weakly-supervised Part (WPS) setting and an approach called WPS-SAM.
WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions.
Experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations.
arXiv Detail & Related papers (2024-07-14T09:31:21Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Scalable Mask Annotation for Video Text Spotting [86.72547285886183]
We propose a scalable mask annotation pipeline called SAMText for video text spotting.
Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips and over 9 million mask annotations.
arXiv Detail & Related papers (2023-05-02T14:18:45Z) - DAN: a Segmentation-free Document Attention Network for Handwritten
Document Recognition [1.7875811547963403]
We propose an end-to-end segmentation-free architecture for handwritten document recognition.
The model is trained to label text parts using begin and end tags in an XML-like fashion.
We achieve competitive results on the READ dataset at page level, as well as double-page level with a CER of 3.53% and 3.69%, respectively.
arXiv Detail & Related papers (2022-03-23T08:40:42Z) - All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection [39.17648241471479]
In this paper, we propose a two-stage segmentation-based detector, termed as NASK (Need A Second looK), for arbitrary-shaped text detection.
arXiv Detail & Related papers (2021-06-24T01:44:10Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.