Related papers: Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

URL: http://arxiv.org/abs/2401.17904v2
Date: Fri, 08 Nov 2024 10:45:12 GMT
Title: Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
Authors: Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, Dacheng Tao,
Abstract summary: This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph. Compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements.
Score: 97.90960864892966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

Related papers

SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters [0.5755004576310334]
This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance.<n>Specifically, we propose a lightweight adapter called Parallel-Text that injects text embeddings into SAM's image, enabling semantics-guided segmentation.<n>We show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines.
arXiv Detail & Related papers (2025-07-31T23:26:39Z)
MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling [58.251621637466904]
Muti-query Scene Text retrieval with Attention Recycling (MSTAR) is a box-free approach for scene text retrieval.<n>It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts.<n>Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset.
arXiv Detail & Related papers (2025-06-12T11:54:13Z)
Vision and Language Reference Prompt into SAM for Few-shot Segmentation [1.9458156037869137]
Segment Anything Model (SAM) is a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts. Few-shot segmentation models address these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts. We propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM, that utilizes the visual information of the reference images and the semantic information of the text labels.
arXiv Detail & Related papers (2025-02-02T08:40:14Z)
Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts [12.444549174054988]
Char-SAM is a pipeline that turns SAM into a low-cost segmentation annotator with a character-level visual prompt. Char-SAM generates high-quality scene text segmentation annotations automatically. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
arXiv Detail & Related papers (2024-12-27T20:33:39Z)
Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments. We propose UOIS-SAM, a data-efficient solution for the UOIS task. UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z)
SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation [88.80792308991867]
Segment Anything model (SAM) has shown ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains.
arXiv Detail & Related papers (2024-07-23T17:47:25Z)
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models [43.27699553774037]
We propose Weakly-supervised Part (WPS) setting and an approach called WPS-SAM. WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. Experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations.
arXiv Detail & Related papers (2024-07-14T09:31:21Z)
PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework. We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z)
Scalable Mask Annotation for Video Text Spotting [86.72547285886183]
We propose a scalable mask annotation pipeline called SAMText for video text spotting. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips and over 9 million mask annotations.
arXiv Detail & Related papers (2023-05-02T14:18:45Z)
DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition [1.7875811547963403]
We propose an end-to-end segmentation-free architecture for handwritten document recognition. The model is trained to label text parts using begin and end tags in an XML-like fashion. We achieve competitive results on the READ dataset at page level, as well as double-page level with a CER of 3.53% and 3.69%, respectively.
arXiv Detail & Related papers (2022-03-23T08:40:42Z)
All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection [39.17648241471479]
In this paper, we propose a two-stage segmentation-based detector, termed as NASK (Need A Second looK), for arbitrary-shaped text detection.
arXiv Detail & Related papers (2021-06-24T01:44:10Z)
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.