Fugu-MT 論文翻訳(概要): INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

論文の概要: INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

arxiv url: http://arxiv.org/abs/2604.11970v1
Date: Mon, 13 Apr 2026 19:03:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.075187
Title: INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents
Title（参考訳）: INDOTABVQA:バハサ・インドネシア文書における言語間テーブル理解のためのベンチマーク
Authors: Somraj Gautam, Anathapindika Dravichi, Gaurav Harit,
Abstract要約: INDOTABVQAは、バハサ・インドネシアの実際の文書画像に対して、言語横断的な表視覚質問回答(VQA)を評価するためのベンチマークである。データセットは、3つの視覚的なスタイルで1,593のドキュメントイメージと4つの言語で1,593の質問回答セットで構成されている。我々のデータセット上でコンパクトな3BとLoRAの7Bモデルを微調整すると、精度は11.6%と17.8%向上する。
参考スコア（独自算出の注目度）: 1.9881456274482427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}
Abstract（参考訳）: 本稿では,バハサ・インドネシアにおける実世界の文書画像に対して,対話型質問応答(VQA)を評価するためのベンチマークであるINDOTABVQAを紹介する。データセットは、バハサ・インドネシア語、英語、ヒンディー語、アラビア語の4つの言語からなる1,593の問合せセットと、1つ以上のテーブルを持つ3つの視覚的スタイル(境界、無境界、カラフルな)にわたる1,593の文書画像で構成されている。これにより、単言語(バハサ質問付きバハサ文書)と言語間設定(他の言語での質問付きバハサ文書)の両方で視覚言語モデル(VLM)の評価が可能となる。我々は、オープンソースのVLM(Qwen2.5-VL, Gemma-3, LLaMA-3.2)とGPT-4oをベンチマークし、特に構造的に複雑なテーブルや低リソース言語において、大幅な性能差を明らかにした。我々のデータセット上でコンパクトな3BとLoRAの7Bモデルを微調整すると、精度は11.6%と17.8%向上する。追加入力として明示的なテーブル領域座標を提供することで、テーブルベースの推論のための空間的先行値の値を示すことにより、パフォーマンスが4-7%向上する。本研究は, 言語多様性, ドメイン固有データセットの重要性を明らかにするとともに, 特定の文書理解タスクにおけるVLM性能を大幅に向上させることを示す。 INDOTABVQAは、言語間、構造を意識した文書理解の研究を進めるための貴重なリソースを提供する。 https://huggingface.co/datasets/NusaBharat/INDOTABVQA}

論文の概要: INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

関連論文リスト