Fugu-MT 論文翻訳(概要): ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

論文の概要: ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

arxiv url: http://arxiv.org/abs/2603.25168v1
Date: Thu, 26 Mar 2026 08:37:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.185949
Title: ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis
Title（参考訳）: ET-SAM:一貫したシーンテキスト検出とレイアウト解析のためのSAMにおける効率的なポイントプロンプト予測
Authors: Xike Zhang, Maoyuan Ye, Juhua Liu, Bo Du,
Abstract要約: ET-SAMは、SAMに基づいたテキスト検出とレイアウト解析のための2つのデコーダを備えた効率的なフレームワークである。我々は、いくつかの前景点を達成するために単語のヒートマップを生成する軽量なポイントデコーダをカスタマイズする。学習可能な3つのタスクプロンプトをポイントデコーダと階層マスクデコーダの両方に導入し,データセット間の差を緩和する。
参考スコア（独自算出の注目度）: 39.062952450992746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.
Abstract（参考訳）: Segment Anything Model (SAM) に基づく以前の研究は、シーンテキストの検出とレイアウト解析において有望な性能を達成した。しかし、プロンプトとして数千のフォアグラウンドポイントをサンプリングするためのピクセルレベルのテキストセグメンテーションに典型的な依存は、不満足な推論遅延と限られたデータ利用につながる。上記の問題に対処するために,ET-SAMを提案する。これは,SAMに基づくテキスト検出とレイアウト解析のための2つのデコーダを備えた効率的なフレームワークである。技術的には、いくつかの前景点を達成するために単語のヒートマップを生成する軽量なポイントデコーダをカスタマイズし、過剰なポイントプロンプトを排除し、推論を高速化する。画素レベルのセグメンテーションに依存せずに、異種テキストレベルのアノテーションで既存のデータを活用するための共同トレーニング戦略をさらに設計する。具体的には、複数レベル、ワードレベルのみ、行レベルのみのアノテーションを持つデータセットを、統一的なトレーニングセットとして並列に結合する。これらのデータセットに対して、データセット間の不一致を軽減するために、ポイントデコーダと階層マスクデコーダの両方で、学習可能なタスクプロンプトの3つのセットを導入する。大規模な実験では、以前のSAMアーキテクチャと比較して、ET-SAMは約3$\times$推論アクセラレーションを達成し、HierText上での競合性能を取得し、Total-Text, CTW1500, ICDAR15における平均11.0%のFスコアを改善する。

論文の概要: ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

関連論文リスト