Fugu-MT 論文翻訳(概要): SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

論文の概要: SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

arxiv url: http://arxiv.org/abs/2603.15409v1
Date: Mon, 16 Mar 2026 15:21:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.534778
Title: SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
Title（参考訳）: SEA-Vision:東南アジアにおける総合的文書・シーンテキスト理解のための多言語ベンチマーク
Authors: Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang, Anxiang Zeng, Liujuan Cao,
Abstract要約: 東南アジア11言語を対象に,文書解析とテキスト中心視覚質問応答(TEC-VQA)を共同で評価するベンチマークSEA-Visionを紹介する。 SEA-Visionには、9つの代表的なドキュメントタイプからページを解析する15,234のドキュメントが含まれている。また、テキスト認識、数値計算、比較分析、論理的推論、空間的理解を探索する7,496のTEC-VQA質問応答ペアも提供する。
参考スコア（独自算出の注目度）: 40.4434142867308
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.
Abstract（参考訳）: 多言語文書とシーンテキスト理解は,検索,財務,公共サービスなどのアプリケーションにおいて重要な役割を担っている。しかし、既存のベンチマークのほとんどは高リソース言語に重点を置いており、現実的な多言語環境でのモデルの評価に失敗している。東南アジアでは、言語、複雑な書記システム、高度に多様な文書タイプが、この課題をさらに大きくしている。東南アジア11言語を対象に,文書解析とテキスト中心視覚質問応答(TEC-VQA)を共同で評価するベンチマークSEA-Visionを紹介する。 SEA-Visionには、9つの代表的なドキュメントタイプからページを解析する15,234のドキュメントが含まれている。また、テキスト認識、数値計算、比較分析、論理的推論、空間的理解を探索する7,496のTEC-VQA質問応答ペアも提供する。このような多言語でマルチタスクなアノテーションを実現するために、文書解析とTEC-VQAのためのハイブリッドパイプラインを設計する。自動フィルタリングとスコアリングとMLLMによるラベリングと軽量なネイティブスピーカー検証を組み合わせることで、高品質を維持しながら手動ラベリングを大幅に削減する。我々は、複数の主要なマルチモーダルモデルを評価し、低リソースの東南アジアの言語で顕著な性能劣化を観察し、多言語文書とシーンテキスト理解におけるかなりのギャップを浮き彫りにした。 SEA-Visionは、文書やシーンのテキスト理解のグローバルな進歩に役立ちます。

論文の概要: SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

関連論文リスト