Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition
- URL: http://arxiv.org/abs/2502.06100v1
- Date: Mon, 10 Feb 2025 02:12:24 GMT
- Title: Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition
- Authors: Chenyu Liu, Jinshui Hu, Baocai Yin, Jia Pan, Bing Yin, Jun Du, Qingfeng Liu,
- Abstract summary: Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications.
Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders.
We propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process.
- Score: 82.88856416080331
- License:
- Abstract: Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications. Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders, combined with a CTC or attention-based recognition decoder. However, these approaches face several drawbacks: 1) single encoders typically focus on either local trajectories or visual regions, lacking the ability to dynamically capture relevant global features in challenging cases; 2) multi-stream encoders, while more comprehensive, suffer from complex structures and increased inference costs. To tackle this, we propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process. Col-OLHTR consists of a trajectory encoder, a Point-to-Spatial Alignment (P2SA) module, and an attention-based decoder. The P2SA module is designed to learn image-level spatial features through trajectory-encoded features and 2D rotary position embeddings. During training, an additional image-stream encoder-decoder is collaboratively trained to provide supervision for P2SA features. At inference, the extra streams are discarded, and only the P2SA module is used and merged before the decoder, simplifying the process while preserving high performance. Extensive experimental results on several OLHTR benchmarks demonstrate the state-of-the-art (SOTA) performance, proving the effectiveness and robustness of our design.
Related papers
- EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.
We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.
We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z) - A Simple Baseline with Single-encoder for Referring Image Segmentation [14.461024566536478]
We present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention.
Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets.
arXiv Detail & Related papers (2024-08-28T04:14:01Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive
Feature Learning in Speech Enhancement [0.2538209532048866]
This paper proposes a time-frequency (T-F) domain speech enhancement network (DPCFCS-Net)
It incorporates improved densely connected blocks, dual-path modules, convolution-augmented transformers (conformers), channel attention, and spatial attention.
Compared with previous models, our proposed model has a more efficient encoder-decoder and can learn comprehensive features.
arXiv Detail & Related papers (2023-06-09T12:52:01Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Adjacent Context Coordination Network for Salient Object Detection in
Optical Remote Sensing Images [102.75699068451166]
We propose a novel Adjacent Context Coordination Network (ACCoNet) to explore the coordination of adjacent features in an encoder-decoder architecture for optical RSI-SOD.
The proposed ACCoNet outperforms 22 state-of-the-art methods under nine evaluation metrics, and runs up to 81 fps on a single NVIDIA Titan X GPU.
arXiv Detail & Related papers (2022-03-25T14:14:55Z) - LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z) - Auto-Encoder based Co-Training Multi-View Representation Learning [10.120166898507328]
We propose a novel algorithm called Auto-encoder based Co-training Multi-View Learning (ACMVL)
The algorithm has two stages, the first is to train auto-encoder of each view, and the second stage is to train a supervised network.
According to the experimental result, we can learn a well performed latent feature representation, and auto-encoder of each view has more powerful reconstruction ability than traditional auto-encoder.
arXiv Detail & Related papers (2022-01-09T10:20:16Z) - Representation and Correlation Enhanced Encoder-Decoder Framework for
Scene Text Recognition [10.496558786568672]
We propose a Representation and Correlation Enhanced-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck.
In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map.
In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space.
arXiv Detail & Related papers (2021-06-13T10:36:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.