Related papers: Class-Agnostic Region-of-Interest Matching in Document Images

Class-Agnostic Region-of-Interest Matching in Document Images

URL: http://arxiv.org/abs/2506.21055v1
Date: Thu, 26 Jun 2025 07:09:19 GMT
Title: Class-Agnostic Region-of-Interest Matching in Document Images
Authors: Demin Zhang, Jiahao Lyu, Zhijie Shen, Yu Zhou,
Abstract summary: This paper defines a new task named Class-Agnostic Region-of-Interest Matching''<n>It aims to match customized regions in a flexible, efficient, multi-granularity, and open-set manner.<n>We construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions.<n>We also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features.
Score: 5.0512633844625405
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.

Related papers

MoDora: Tree-Based Semi-Structured Document Analysis System [62.01015188258797]
Semi-structured documents integrate diverse interleaved data elements arranged in various and often irregular layouts.<n>MoDora is an LLM-powered system for semi-structured document analysis.<n> Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy.
arXiv Detail & Related papers (2026-02-26T14:48:49Z)
CMRAG: Co-modality-based visual document retrieval and question answering [21.016544020685668]
Co-Modality-based RAG (RAG) framework can leverage texts and images for more accurate retrieval and generation.<n>Our framework consistently outperforms single-modality-based RAG in multiple visual document question-answering (VDQA) benchmarks.
arXiv Detail & Related papers (2025-09-02T09:17:57Z)
ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links [57.514511353084565]
We introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links.<n>We apply our framework in two distinct domains -- peer review and news.<n>The resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review.
arXiv Detail & Related papers (2025-09-01T11:32:24Z)
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning [39.10966524559436]
Document image segmentation is crucial for document analysis and recognition.<n>Existing methods address these tasks separately, resulting in limited generalization and resource wastage.<n>This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks.
arXiv Detail & Related papers (2025-04-05T07:14:53Z)
S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis [0.0]
Document chunking is a critical task in natural language processing (NLP)<n>This paper introduces a novel hybrid approach that combines layout structure, semantic analysis, and spatial relationships.<n> Experimental results demonstrate that this approach outperforms traditional methods.
arXiv Detail & Related papers (2025-01-08T09:06:29Z)
Subtopic-aware View Sampling and Temporal Aggregation for Long-form Document Matching [34.81690842091582]
Long-form document matching aims to judge the relevance between two documents.<n>We introduce a new framework to model representative matching signals.<n>Our learning framework is effective on several document-matching tasks, including news duplication and legal case retrieval.
arXiv Detail & Related papers (2024-12-10T15:06:48Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time. Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z)
SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z)
Cross-document Event Coreference Search: Task, Dataset and Modeling [26.36068336169796]
We propose an appealing, and often more applicable, complementary set up for the task - Cross-document Coreference Search. To support research on this task, we create a corresponding dataset, which is derived from Wikipedia. We present a novel model that integrates a powerful coreference scoring scheme into the DPR architecture, yielding improved performance.
arXiv Detail & Related papers (2022-10-23T08:21:25Z)
RDU: A Region-based Approach to Form-style Document Understanding [69.29541701576858]
Key Information Extraction (KIE) is aimed at extracting structured information from form-style documents. We develop a new KIE model named Region-based Understanding Document (RDU) RDU takes as input the text content and corresponding coordinates of a document, and tries to predict the result by localizing a bounding-box-like region.
arXiv Detail & Related papers (2022-06-14T14:47:48Z)
Out-of-Category Document Identification Using Target-Category Names as Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories. We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z)
iFacetSum: Coreference-based Interactive Faceted Summarization for Multi-Document Exploration [63.272359227081836]
iFacetSum integrates interactive summarization together with faceted search. Fine-grained facets are automatically produced based on cross-document coreference pipelines.
arXiv Detail & Related papers (2021-09-23T20:01:11Z)
VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations [40.721146438291335]
We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. On three popular benchmarks, VSR outperforms previous models by large margins.
arXiv Detail & Related papers (2021-05-13T12:20:30Z)
Cross-Domain Document Object Detection: Benchmark Suite and Method [71.4339949510586]
Document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding. We investigate cross-domain DOD, where the goal is to learn a detector for the target domain using labeled data from the source domain and only unlabeled data from the target domain. For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files.
arXiv Detail & Related papers (2020-03-30T03:04:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.