Visual Semantics Allow for Textual Reasoning Better in Scene Text
Recognition
- URL: http://arxiv.org/abs/2112.12916v1
- Date: Fri, 24 Dec 2021 02:43:42 GMT
- Title: Visual Semantics Allow for Textual Reasoning Better in Scene Text
Recognition
- Authors: Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang,
Bo Du
- Abstract summary: We make the first attempt to perform textual reasoning based on visual semantics in this paper.
We devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss.
S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets.
- Score: 46.83992441581874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Scene Text Recognition (STR) methods typically use a language model
to optimize the joint probability of the 1D character sequence predicted by a
visual recognition (VR) model, which ignore the 2D spatial context of visual
semantics within and between character instances, making them not generalize
well to arbitrary shape scene text. To address this issue, we make the first
attempt to perform textual reasoning based on visual semantics in this paper.
Technically, given the character segmentation maps predicted by a VR model, we
construct a subgraph for each instance, where nodes represent the pixels in it
and edges are added between nodes based on their spatial similarity. Then,
these subgraphs are sequentially connected by their root nodes and merged into
a complete graph. Based on this graph, we devise a graph convolutional network
for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR
can be easily plugged in representative STR models to improve their performance
owing to better textual reasoning. Specifically, we construct our model, namely
S-GTR, by paralleling GTR to the language model in a segmentation-based STR
baseline, which can effectively exploit the visual-linguistic complementarity
via mutual learning. S-GTR sets new state-of-the-art on six challenging STR
benchmarks and generalizes well to multi-linguistic datasets. Code is available
at https://github.com/adeline-cs/GTR.
Related papers
- A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - Unleashing the Potential of Text-attributed Graphs: Automatic Relation Decomposition via Large Language Models [31.443478448031886]
RoSE (Relation-oriented Semantic Edge-decomposition) is a novel framework that decomposes the graph structure by analyzing raw text attributes.
Our framework significantly enhances node classification performance across various datasets, with improvements of up to 16% on the Wisconsin dataset.
arXiv Detail & Related papers (2024-05-28T20:54:47Z) - Instruction-Guided Scene Text Recognition [51.853730414264625]
We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem.
We develop lightweight instruction encoder, cross-modal feature fusion module and multi-task answer head, which guides nuanced text image understanding.
IGTR outperforms existing models by significant margins, while maintaining a small model size and efficient inference speed.
arXiv Detail & Related papers (2024-01-31T14:13:01Z) - Pretraining Language Models with Text-Attributed Heterogeneous Graphs [28.579509154284448]
We present a new pretraining framework for Language Models (LMs) that explicitly considers the topological and heterogeneous information in Text-Attributed Heterogeneous Graphs (TAHGs)
We propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network.
We conduct link prediction and node classification tasks on three datasets from various domains.
arXiv Detail & Related papers (2023-10-19T08:41:21Z) - StrokeNet: Stroke Assisted and Hierarchical Graph Reasoning Networks [31.76016966100244]
StrokeNet is proposed to effectively detect the texts by capturing the fine-grained strokes.
Different from existing approaches that represent the text area by a series of points or rectangular boxes, we directly localize strokes of each text instance.
arXiv Detail & Related papers (2021-11-23T08:26:42Z) - R2D2: Relational Text Decoding with Transformers [18.137828323277347]
We propose a novel framework for modeling the interaction between graphical structures and the natural language text associated with their nodes and edges.
Our proposed method utilizes both the graphical structure as well as the sequential nature of the texts.
While the proposed model has wide applications, we demonstrate its capabilities on data-to-text generation tasks.
arXiv Detail & Related papers (2021-05-10T19:59:11Z) - Spatial-spectral Hyperspectral Image Classification via Multiple Random
Anchor Graphs Ensemble Learning [88.60285937702304]
This paper proposes a novel spatial-spectral HSI classification method via multiple random anchor graphs ensemble learning (RAGE)
Firstly, the local binary pattern is adopted to extract the more descriptive features on each selected band, which preserves local structures and subtle changes of a region.
Secondly, the adaptive neighbors assignment is introduced in the construction of anchor graph, to reduce the computational complexity.
arXiv Detail & Related papers (2021-03-25T09:31:41Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Graph Optimal Transport for Cross-Domain Alignment [121.80313648519203]
Cross-domain alignment is fundamental to computer vision and natural language processing.
We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT)
Experiments show consistent outperformance of GOT over baselines across a wide range of tasks.
arXiv Detail & Related papers (2020-06-26T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.