Fugu-MT 論文翻訳(概要): A document is worth a structured record: Principled inductive bias design for document recognition

論文の概要: A document is worth a structured record: Principled inductive bias design for document recognition

arxiv url: http://arxiv.org/abs/2507.08458v1
Date: Fri, 11 Jul 2025 10:02:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-14 18:03:54.316353
Title: A document is worth a structured record: Principled inductive bias design for document recognition
Title（参考訳）: 文書は構造化記録に値する:文書認識のための原理的帰納バイアス設計
Authors: Benjamin Meyer, Lukas Tuggener, Sascha Hänzi, Daniel Schmid, Erdal Ayfer, Benjamin F. Grewe, Ahmed Abdulkadir, Thilo Stadelmann,
Abstract要約: 最先端のアプローチは、文書認識をコンピュータビジョン問題として扱う。文書からレコードへの書き起こしタスクとして文書認識をフレーム化する新しい視点を提案する。これは、その転写に固有の本質的な構造に基づく文書の自然なグループ化を意味する。
参考スコア（独自算出の注目度）: 3.4332178437507936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.
Abstract（参考訳）: 多くの文書タイプでは、エンジニアリング図面を規定する規約など、正確で構造化された情報をエンコードするのに役立つ、本質的な慣習駆動型構造を使用している。しかし、最先端のアプローチは、文書認識を単なるコンピュータビジョン問題として扱い、これらの基礎となる文書タイプ固有の構造特性を無視し、それらが準最適ヒューリスティックな後処理に依存し、現代の文書認識にはアクセスできない、より頻度の低い、より複雑な文書タイプをレンダリングする。文書からレコードへの書き起こしタスクとして文書認識をフレーム化する新しい視点を提案する。これは、関連する文書のタイプを同じように扱う(そして学習する)ことができる、本質的な構造に基づく文書の自然なグループ化を意味する。本稿では,機械学習による文書認識システムに対して,構造固有の帰納バイアスを設計する手法と,異なる構造に適応する基本トランスアーキテクチャを提案する。本研究では, モノラルシート音楽, 形状図面, 簡易なエンジニアリング図面から, 漸進的に複雑な記録構造を応用した大規模な実験において, いわゆる帰納バイアスの有効性を実証する。非制限グラフ構造に対する帰納バイアスを統合することにより、エンジニアリング図面を本質的に相互に関連付けられた情報に書き起こすために、初めて成功したエンドツーエンドモデルを訓練する。本手法は,標準OCRやOMRなどほど理解されていない文書タイプを対象とした文書認識システムの設計を通知し,将来の文書基盤モデルの設計を統一するためのガイドとして機能する。

論文の概要: A document is worth a structured record: Principled inductive bias design for document recognition

関連論文リスト