Fugu-MT 論文翻訳(概要): MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

論文の概要: MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

arxiv url: http://arxiv.org/abs/2604.04771v1
Date: Mon, 06 Apr 2026 15:44:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.255378
Title: MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Title（参考訳）: MinerU2.5-Pro: 大規模データ中心のドキュメント解析の限界を押し上げる
Authors: Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He,
Abstract要約: 我々は、データエンジニアリングとトレーニング戦略最適化のみで技術の現状を進展させるMinruproを提案する。 mineruproはOmniDocBenchv1.6で95.69を達成し、同じアーキテクチャのベースラインを2.71ポイント改善し、既存のすべてのメソッドを上回った。
参考スコア（独自算出の注目度）: 92.09717763663873
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.
Abstract（参考訳）: 現在の文書解析手法は、主にモデルアーキテクチャの革新と競合するが、トレーニングデータの体系的なエンジニアリングは未熟である。しかし、異なるアーキテクチャとパラメータスケールのSOTAモデルは、同じハードサンプルセット上で非常に一貫した障害パターンを示しており、パフォーマンスボトルネックは、アーキテクチャ自体よりも、トレーニングデータの共有欠陥に起因することを示唆している。そこで本研究では,データ工学とトレーニング戦略最適化のみで最先端の手法を推し進めつつ,1.2Bパラメータアーキテクチャを完全固定した。ダイバーシティ・アンド・ディフルティ・アウェア・サンプリングは、分散シフトを補正しながら、トレーニングデータを10M未満のサンプルから65.5Mまで拡張する。クロスモデル整合性検証は、サンプルの難易度を評価し、信頼できるアノテーションを生成するために異種モデル間の出力合意を活用する。 3段階のプログレッシブトレーニング戦略 – 大規模な事前トレーニング、ハードサンプルの微調整、GRPOアライメント – は、これらのデータをさまざまな品質レベルで順次活用する。評価面では、OmniDocBench~v1.5の要素マッチングバイアスを修正し、より差別的なOmniDocBench~v1.6プロトコルを確立し、ハードサブセットを導入します。アーキテクチャの変更がなければ、OmniDocBench~v1.6で95.69を達成し、同じアーキテクチャのベースラインを2.71ポイント改善し、200$\times$以上のパラメータを持つモデルを含む既存のすべてのメソッドを上回った。

論文の概要: MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

関連論文リスト