Representing Online Handwriting for Recognition in Large Vision-Language
Models
- URL: http://arxiv.org/abs/2402.15307v1
- Date: Fri, 23 Feb 2024 13:11:10 GMT
- Title: Representing Online Handwriting for Recognition in Large Vision-Language
Models
- Authors: Anastasiia Fadeeva, Philippe Schlattner, Andrii Maksai, Mark Collier,
Efi Kokiopoulou, Jesse Berent, Claudiu Musat
- Abstract summary: We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image.
We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers.
- Score: 8.344510330567495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The adoption of tablets with touchscreens and styluses is increasing, and a
key feature is converting handwriting to text, enabling search, indexing, and
AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to
solution for image understanding, thanks to both their state-of-the-art
performance across a variety of tasks and the simplicity of a unified approach
to training, fine-tuning, and inference. While VLMs obtain high performance on
image-based tasks, they perform poorly on handwriting recognition when applied
naively, i.e., by rendering handwriting as an image and performing optical
character recognition (OCR). In this paper, we study online handwriting
recognition with VLMs, going beyond naive OCR. We propose a novel tokenized
representation of digital ink (online handwriting) that includes both a
time-ordered sequence of strokes as text, and as image. We show that this
representation yields results comparable to or better than state-of-the-art
online handwriting recognizers. Wide applicability is shown through results
with two different VLM families, on multiple public datasets. Our approach can
be applied to off-the-shelf VLMs, does not require any changes in their
architecture, and can be used in both fine-tuning and parameter-efficient
tuning. We perform a detailed ablation study to identify the key elements of
the proposed representation.
Related papers
- Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image.
We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP.
Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - CLIPTER: Looking at the Bigger Picture in Scene Text Recognition [10.561377899703238]
We harness the capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer.
We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a cross-attention gated mechanism.
arXiv Detail & Related papers (2023-01-18T12:16:19Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - UIT-HWDB: Using Transferring Method to Construct A Novel Benchmark for
Evaluating Unconstrained Handwriting Image Recognition in Vietnamese [2.8360662552057323]
In Vietnamese, besides the modern Latin characters, there are accent and letter marks together with characters that draw confusion to state-of-the-art handwriting recognition methods.
As a low-resource language, there are not many datasets for researching handwriting recognition in Vietnamese.
Recent works evaluated offline handwriting recognition methods in Vietnamese using images from an online handwriting dataset constructed by connecting pen stroke coordinates without further processing.
This paper proposes the Transferring method to construct a handwriting image dataset that associates crucial natural attributes required for offline handwriting images.
arXiv Detail & Related papers (2022-11-10T08:23:54Z) - Boosting Modern and Historical Handwritten Text Recognition with
Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task.
We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z) - Content and Style Aware Generation of Text-line Images for Handwriting
Recognition [4.301658883577544]
We propose a generative method for handwritten text-line images conditioned on both visual appearance and textual content.
Our method is able to produce long text-line samples with diverse handwriting styles.
arXiv Detail & Related papers (2022-04-12T05:52:03Z) - SmartPatch: Improving Handwritten Word Imitation with Patch
Discriminators [67.54204685189255]
We propose SmartPatch, a new technique increasing the performance of current state-of-the-art methods.
We combine the well-known patch loss with information gathered from the parallel trained handwritten text recognition system.
This leads to a more enhanced local discriminator and results in more realistic and higher-quality generated handwritten words.
arXiv Detail & Related papers (2021-05-21T18:34:21Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.