Related papers: Towards Universal Khmer Text Recognition

Towards Universal Khmer Text Recognition

URL: http://arxiv.org/abs/2603.00702v1
Date: Sat, 28 Feb 2026 15:23:09 GMT
Title: Towards Universal Khmer Text Recognition
Authors: Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing, Masakazu Iwamura, Koichi Kise,
Abstract summary: Khmer is a low-resource language characterized by a complex script.<n>Training modality-specific models for each modality does not allow cross-modality transfer learning.<n>We propose a universal Khmer text recognition framework capable of handling diverse text modalities.
Score: 3.5477182055025107
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.

Related papers

UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings [9.344107676552408]
We propose UniMoCo, a vision-language model architecture designed for multi-modal embedding tasks.<n>We develop a specialized training strategy to align embeddings from both original and modality-completed inputs.<n>Experiments show that UniMoCo outperforms previous methods while demonstrating consistent robustness across diverse settings.
arXiv Detail & Related papers (2025-05-17T03:53:11Z)
Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model.<n>It introduces a pre-trained backbone CLIP to learn basic multimodal features.<n>It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z)
Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention [45.31956918333587]
In multimodal sentiment analysis, collecting text data is often more challenging than video or audio.<n>We have developed a robust model that integrates multimodal sentiment information, even in the absence of text modality.
arXiv Detail & Related papers (2024-10-19T07:59:41Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification [49.41632476658246]
We discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets. The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts. We propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles.
arXiv Detail & Related papers (2024-07-21T13:26:30Z)
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities [6.9522425458326635]
We propose a multi-tower decoder architecture that flexibly composes multimodal generative models from independently pre-trained unimodal decoders. We show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.
arXiv Detail & Related papers (2024-05-29T00:23:55Z)
Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text. Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information. experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z)
Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning [9.949354222717773]
Cross-modal attribute insertions are a realistic perturbation strategy for vision-and-language data. We find that augmenting input text using cross-modal insertions causes state-of-the-art approaches for text-to-image retrieval and cross-modal entailment to perform poorly. Crowd-sourced annotations demonstrate that cross-modal insertions lead to higher quality augmentations for multimodal data.
arXiv Detail & Related papers (2023-06-19T17:00:03Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs) We compare their accuracy and performance on widely used public datasets of scene and handwritten text. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.