Related papers: PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

URL: http://arxiv.org/abs/2601.21957v1
Date: Thu, 29 Jan 2026 16:35:04 GMT
Title: PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Authors: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, Yanjun Ma,
Abstract summary: We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5.<n>We extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency.
Score: 16.27904802735372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model's capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

Related papers

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR [0.29410438275861583]
We present textbfLightOnOCR-2-1B, a multilingual vision--language model that converts document images into clean, naturally ordered text without brittle OCR pipelines.<n>Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench.<n>We release model checkpoints under Apache 2.0, and publicly release the dataset and textbfLightOnOCR-bbox-bench evaluation under their respective licenses.
arXiv Detail & Related papers (2026-01-20T18:58:32Z)
NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards [41.87267797252411]
Vision-language-action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization.<n>We introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert.<n>To further improve robustness and task success, we develop a set of reward models for post-training VLA policies.
arXiv Detail & Related papers (2025-11-18T16:55:48Z)
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model [24.435689905776744]
PaddleOCR-VL-0.9B is a compact yet powerful vision-language model (VLM)<n>It integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition.<n>This innovative model efficiently supports 109 languages and excels in recognizing complex elements.
arXiv Detail & Related papers (2025-10-16T10:18:48Z)
PocketSR: The Super-Resolution Expert in Your Pocket Mobiles [69.26751136689533]
Real-world image super-resolution (RealSR) aims to enhance the visual quality of in-the-wild images, such as those captured by mobile phones.<n>Existing methods leveraging large generative models demonstrate impressive results, but the high computational cost and latency make them impractical for edge deployment.<n>We introduce PocketSR, an ultra-lightweight, single-step model that brings generative modeling capabilities to RealSR while maintaining high fidelity.
arXiv Detail & Related papers (2025-10-03T13:56:18Z)
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations [52.1029745126386]
In vision-language-action (VLA) models, robustness to real-world perturbations is critical for deployment.<n>We propose RobustVLA against perturbations in VLA inputs and outputs.<n> Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone.
arXiv Detail & Related papers (2025-09-26T14:42:23Z)
E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition [3.186993645370078]
In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments.<n>We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images.<n>Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones due to their low compute requirements, low
arXiv Detail & Related papers (2025-09-03T18:08:41Z)
Ultra-Resolution Adaptation with Ease [62.56434979517156]
We propose a set of key guidelines for ultra-resolution adaptation termed emphURAE.<n>We show that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable.<n>Experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations.
arXiv Detail & Related papers (2025-03-20T16:44:43Z)
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer [49.1761733723771]
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation.<n>We introduce three key innovations: Efficient Training Scaling, Model Depth Pruning, and Inference-time Scaling.<n>Through these strategies, SANA-1.5 achieves a text computation-image alignment score of 0.81 on GenEval, which can be further improved to 0.96 through inference scaling with VILA-Judge.
arXiv Detail & Related papers (2025-01-30T15:31:48Z)
Robust Fine-tuning of Zero-shot Models via Variance Reduction [56.360865951192324]
When fine-tuning zero-shot models, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD) We propose a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs.
arXiv Detail & Related papers (2024-11-11T13:13:39Z)
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection [23.464027681439706]
Grounding DINO 1.5 is a suite of advanced open-set object detection models developed by IDEA Research. Grounding DINO 1.5 Pro is a high-performance model designed for stronger generalization capability across a wide range of scenarios. Grounding DINO 1.5 Edge is an efficient optimized model for faster speed demanded in many applications requiring edge deployment.
arXiv Detail & Related papers (2024-05-16T17:54:15Z)
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs) LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K. Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.