Dynamic Relation Transformer for Contextual Text Block Detection
- URL: http://arxiv.org/abs/2401.09232v1
- Date: Wed, 17 Jan 2024 14:17:59 GMT
- Title: Dynamic Relation Transformer for Contextual Text Block Detection
- Authors: Jiawei Wang, Shunchi Zhang, Kai Hu, Chixiang Ma, Zhuoyao Zhong, Lei
Sun, Qiang Huo
- Abstract summary: Contextual Text Block Detection is the task of identifying coherent text blocks within the complexity of natural scenes.
Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem.
We introduce a new framework that frames CTBD as a graph generation problem.
- Score: 9.644204545582742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual Text Block Detection (CTBD) is the task of identifying coherent
text blocks within the complexity of natural scenes. Previous methodologies
have treated CTBD as either a visual relation extraction challenge within
computer vision or as a sequence modeling problem from the perspective of
natural language processing. We introduce a new framework that frames CTBD as a
graph generation problem. This methodology consists of two essential
procedures: identifying individual text units as graph nodes and discerning the
sequential reading order relationships among these units as graph edges.
Leveraging the cutting-edge capabilities of DQ-DETR for node detection, our
framework innovates further by integrating a novel mechanism, a Dynamic
Relation Transformer (DRFormer), dedicated to edge generation. DRFormer
incorporates a dual interactive transformer decoder that deftly manages a
dynamic graph structure refinement process. Through this iterative process, the
model systematically enhances the graph's fidelity, ultimately resulting in
improved precision in detecting contextual text blocks. Comprehensive
experimental evaluations conducted on both SCUT-CTW-Context and ReCTS-Context
datasets substantiate that our method achieves state-of-the-art results,
underscoring the effectiveness and potential of our graph generation framework
in advancing the field of CTBD.
Related papers
- Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy
in Transformer [88.61312640540902]
We introduce Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter)
Our model achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-20T03:22:23Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP)
Taking the first frame and text caption as inputs, this task aims to synthesize the following frames.
To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - R2D2: Relational Text Decoding with Transformers [18.137828323277347]
We propose a novel framework for modeling the interaction between graphical structures and the natural language text associated with their nodes and edges.
Our proposed method utilizes both the graphical structure as well as the sequential nature of the texts.
While the proposed model has wide applications, we demonstrate its capabilities on data-to-text generation tasks.
arXiv Detail & Related papers (2021-05-10T19:59:11Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.