DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
- URL: http://arxiv.org/abs/2504.04085v1
- Date: Sat, 05 Apr 2025 07:14:53 GMT
- Title: DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
- Authors: Xiao-Hui Li, Fei Yin, Cheng-Lin Liu,
- Abstract summary: Document image segmentation is crucial for document analysis and recognition.<n>Existing methods address these tasks separately, resulting in limited generalization and resource wastage.<n>This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks.
- Score: 39.10966524559436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.
Related papers
- MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval.<n>The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions.
arXiv Detail & Related papers (2025-01-15T14:30:13Z) - S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis [0.0]
Document chunking is a critical task in natural language processing (NLP)<n>This paper introduces a novel hybrid approach that combines layout structure, semantic analysis, and spatial relationships.<n> Experimental results demonstrate that this approach outperforms traditional methods.
arXiv Detail & Related papers (2025-01-08T09:06:29Z) - Instance-Aware Generalized Referring Expression Segmentation [32.96760407482406]
InstAlign is a method that incorporates object-level reasoning into the segmentation process.
Our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
arXiv Detail & Related papers (2024-11-22T17:28:43Z) - Unified Multimodal Interleaved Document Representation for Retrieval [57.65409208879344]
We propose a method that holistically embeds documents interleaved with multiple modalities.<n>We merge the representations of segmented passages into one single document representation.<n>We show that our approach substantially outperforms relevant baselines.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - DLAFormer: An End-to-End Transformer For Document Layout Analysis [7.057192434574117]
We propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer.
We treat various DLA sub-tasks as relation prediction problems and consolidate these relation prediction labels into a unified label space.
We introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR.
arXiv Detail & Related papers (2024-05-20T03:34:24Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - DocSegTr: An Instance-Level End-to-End Document Image Segmentation
Transformer [16.03084865625318]
Business intelligence processes often require the extraction of useful semantic content from documents.
We present a transformer-based model for end-to-end segmentation of complex layouts in document images.
Our model achieved comparable or better segmentation performance than the existing state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-27T10:50:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.