Related papers: HunyuanOCR Technical Report

HunyuanOCR Technical Report

URL: http://arxiv.org/abs/2511.19575v1
Date: Mon, 24 Nov 2025 17:59:59 GMT
Title: HunyuanOCR Technical Report
Authors: Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang,
Abstract summary: HunyuanOCR is a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks.<n>It surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation)<n>It achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters.
Score: 28.160663178408864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

Related papers

FireRed-OCR Technical Report [30.019999826760003]
We introduce FireRed-OCR, a framework to transform general-purpose VLMs into pixel-precise structural document parsing experts.<n>To address the scarcity of high-quality structured data, we construct a Geometry + Semantics'' Data Factory.<n>We propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation.
arXiv Detail & Related papers (2026-03-02T13:19:23Z)
A Generalizable Framework for Building Executable Domain-Specific LLMs under Data Scarcity: Demonstration on Semiconductor TCAD Simulation [20.174394305112198]
We present a framework for building compact, executable domain-specific LLMs in low-resource settings.<n>We demonstrate the framework by instantiating TcadGPT for semiconductor Technology Computer-Aided Design (TCAD)<n>Using 1.5M synthetic QA pairs and an IR-driven DPO dataset, TcadGPT attains 85.6% semantic accuracy and an 80.0% syntax pass rate on SDE executability tests.
arXiv Detail & Related papers (2026-01-15T07:13:34Z)
Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation [52.11339614452127]
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions.<n>Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities.<n>We propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner.
arXiv Detail & Related papers (2025-11-18T04:32:00Z)
AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs [7.850805629833066]
We argue that AI development should be fundamentally reoriented toward capability-per-resource rather than capability alone.<n>We present a theoretical framework demonstrating that resource-allocation decisions guided by gradient influence patterns can dramatically improve efficiency throughout the AI lifecycle.
arXiv Detail & Related papers (2025-11-02T20:59:51Z)
CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z)
SAIL-VL2 Technical Report [65.45818722427506]
We introduce SAIL-VL2, an open-suite vision foundation model (LVM) for comprehensive multimodal understanding and reasoning.<n>SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks.
arXiv Detail & Related papers (2025-09-17T14:34:02Z)
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use [78.29315418819074]
We introduce VerlTool, a unified and modular framework that addresses limitations through systematic design principles.<n>Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms.<n>The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions.
arXiv Detail & Related papers (2025-09-01T01:45:18Z)
DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model [9.557159109747372]
Large vision-language models (LVLMs) are prone to hallucinations--generating words that do not exist in input images.<n>We propose DianJin-OCR-R1, a reasoning-and-tool interleaved VLMs trained on domain-specific datasets.
arXiv Detail & Related papers (2025-08-18T03:28:57Z)
Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z)
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models [42.75418134743927]
Reason-RFT is a two-stage reinforcement fine-tuning framework for visual reasoning.<n>First,Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of Vision-Language Models (VLMs)<n>Second, reinforcement learning based on Group Relative Policy Optimization (GRPO) generates multiple reasoning-response pairs to enhance adaptability to domain shifts.
arXiv Detail & Related papers (2025-03-26T17:38:06Z)
SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z)
Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.