InkFM: A Foundational Model for Full-Page Online Handwritten Note Understanding
- URL: http://arxiv.org/abs/2503.23081v1
- Date: Sat, 29 Mar 2025 13:45:24 GMT
- Title: InkFM: A Foundational Model for Full-Page Online Handwritten Note Understanding
- Authors: Anastasiia Fadeeva, Vincent Coriou, Diego Antognini, Claudiu Musat, Andrii Maksai,
- Abstract summary: We introduce a foundational model called InkFM for analyzing full pages of handwritten content.<n>It offers a unique combination of capabilities: recognizing text in 28 different scripts, mathematical expressions recognition, and segmenting pages into distinct elements like text and drawings.
- Score: 10.065311465170382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tablets and styluses are increasingly popular for taking notes. To optimize this experience and ensure a smooth and efficient workflow, it's important to develop methods for accurately interpreting and understanding the content of handwritten digital notes. We introduce a foundational model called InkFM for analyzing full pages of handwritten content. Trained on a diverse mixture of tasks, this model offers a unique combination of capabilities: recognizing text in 28 different scripts, mathematical expressions recognition, and segmenting pages into distinct elements like text and drawings. Our results demonstrate that these tasks can be effectively unified within a single model, achieving SoTA text line segmentation out-of-the-box quality surpassing public baselines like docTR. Fine- or LoRA-tuning our base model on public datasets further improves the quality of page segmentation, achieves state-of the art text recognition (DeepWriting, CASIA, SCUT, and Mathwriting datasets) and sketch classification (QuickDraw). This adaptability of InkFM provides a powerful starting point for developing applications with handwritten input.
Related papers
- Handwritten Digit Recognition: An Ensemble-Based Approach for Superior Performance [9.174021241188143]
This paper presents an ensemble-based approach that combines Convolutional Neural Networks (CNNs) with traditional machine learning techniques to improve recognition accuracy and robustness.<n>We evaluate our method on the MNIST dataset, comprising 70,000 handwritten digit images.<n>Our hybrid model, which uses CNNs for feature extraction and Support Vector Machines (SVMs) for classification, achieves an accuracy of 99.30%.
arXiv Detail & Related papers (2025-03-08T07:09:49Z) - Contrastive Masked Autoencoders for Character-Level Open-Set Writer Identification [25.996617568144675]
This paper introduces Contrastive Masked Auto-Encoders (CMAE) for Character-level Open-Set Writer Identification.<n>We merge Masked Auto-Encoders (MAE) with Contrastive Learning (CL) to simultaneously and respectively capture sequential information and distinguish diverse handwriting styles.<n>Our model achieves state-of-the-art results on the CASIA online handwriting dataset, reaching an impressive precision rate of 89.7%.
arXiv Detail & Related papers (2025-01-21T05:15:10Z) - HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis [21.25786478579275]
Handwritten document recognition is one of the most challenging tasks in computer vision.<n>Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis.<n>This paper introduces HAND, a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks.
arXiv Detail & Related papers (2024-12-25T20:36:29Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - Representing Online Handwriting for Recognition in Large Vision-Language
Models [8.344510330567495]
We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image.
We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers.
arXiv Detail & Related papers (2024-02-23T13:11:10Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - Sampling and Ranking for Digital Ink Generation on a tight computational
budget [69.15275423815461]
We study ways to maximize the quality of the output of a trained digital ink generative model.
We use and compare the effect of multiple sampling and ranking techniques, in the first ablation study of its kind in the digital ink domain.
arXiv Detail & Related papers (2023-06-02T09:55:15Z) - Boosting Modern and Historical Handwritten Text Recognition with
Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task.
We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z) - Continuous Offline Handwriting Recognition using Deep Learning Models [0.0]
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis.
We have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq)
The new proposed model provides competitive results with those obtained with other well-established methodologies.
arXiv Detail & Related papers (2021-12-26T07:31:03Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.