Related papers: TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images

TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images

URL: http://arxiv.org/abs/2509.13151v1
Date: Tue, 16 Sep 2025 15:05:55 GMT
Title: TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images
Authors: Rohan Kumar, Jyothi Swaroopa Jinka, Ravi Kiran Sarvadevabhatla,
Abstract summary: We introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR)<n>Our architecture employs a 2D RoPE (Rotary Positional Embedding)-style mechanism to incorporate input context for more accurate predictions.<n>We also introduce MMTAD, a diverse, multilingual, multi-domain dataset annotated with text attributes across real-world documents.
Score: 8.505694818967674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recognizing textual attributes such as bold, italic, underline and strikeout is essential for understanding text semantics, structure, and visual presentation. These attributes highlight key information, making them crucial for document analysis. Existing methods struggle with computational efficiency or adaptability in noisy, multilingual settings. To address this, we introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR). Our novel data selection pipeline enhances context awareness, and our architecture employs a 2D RoPE (Rotary Positional Embedding)-style mechanism to incorporate input context for more accurate attribute predictions. We also introduce MMTAD, a diverse, multilingual, multi-domain dataset annotated with text attributes across real-world documents such as legal records, notices, and textbooks. Extensive evaluations show TexTAR outperforms existing methods, demonstrating that contextual awareness contributes to state-of-the-art TAR performance.

Related papers

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding [42.506971197471195]
We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following.<n>Our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks.
arXiv Detail & Related papers (2025-05-08T17:37:36Z)
Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing.<n>Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z)
TSAL: Few-shot Text Segmentation Based on Attribute Learning [21.413607725856263]
We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation.<n>To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates.<n>Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings.
arXiv Detail & Related papers (2025-04-15T13:12:42Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [78.73711446918814]
We propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve AGReID performance.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model. Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents. Text reading and information extraction can reinforce each other via a well-designed multi-modal context block. The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z)
StrucTexT: Structured Text Understanding with Multi-Modal Transformers [29.540122964399046]
Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts.
arXiv Detail & Related papers (2021-08-06T02:57:07Z)
TRIE: End-to-End Text Reading and Information Extraction for Document Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network. multimodal visual and textual features of text reading are fused for information extraction. Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.