Fugu-MT 論文翻訳(概要): TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

論文の概要: TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

arxiv url: http://arxiv.org/abs/2604.03660v1
Date: Sat, 04 Apr 2026 09:26:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.706005
Title: TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
Title（参考訳）: TableVision: 複雑な階層テーブル上の空間的接地推論のための大規模ベンチマーク
Authors: Xiaoyu Chen, Lu Dai, Hanqing Wang, Zhuoyu Li, Wenbin Dai, Yanzong Zheng, Zhenggang Xia, Junyong Lin, Hui Xiong,
Abstract要約: タスクの複雑さが拡大するにつれて、関連する離散的な視覚領域の数が不均等に増加することが分かる。この処理密度は内部の"知覚的過負荷"につながる本研究では,空間的推論のためのトラジェクトリ・アウェア・ベンチマークであるTableVisionを紹介する。
参考スコア（独自算出の注目度）: 13.218805579902048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal "Perceptual Overload," where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.
Abstract（参考訳）: 構造化テーブルは、金融、医療、科学研究などの専門分野における高密度情報伝達に不可欠である。 MLLM(Multimodal Large Language Models)の進歩にもかかわらず、階層的なレイアウトを持つ複雑なテーブルの推論性能は依然として限られている。本稿では,定量的解析により重要な知覚ボトルネックを同定する。タスクの複雑さが拡大するにつれて、関連する離散的な視覚領域の数が不均等に増加することが分かる。この処理密度は内部の"知覚的過負荷"につながり、MLLMは暗黙発生時に正確な空間的注意を維持するのに苦労する。このボトルネックに対処するため,空間的推論のための大規模トラジェクトリ対応ベンチマークであるTableVisionを導入する。 TableVisionは、表のタスクを3つの認知レベル(知覚、推論、分析)に分類し、13のサブカテゴリにまたがる。レンダリングに基づく決定論的グラウンドパイプラインを利用することで、データセットは、多段階の論理的推論を画素完全空間的グラウンド真理と明示的に結合し、6,799の高忠実な推論軌道を含む。診断的探索によって支援された経験的結果から,空間的制約がMLLMの推論ポテンシャルを著しく回復することが示された。さらに、我々の2段階分離フレームワークは、テストセットの全体的な精度を12.3%向上させる。 TableVisionは、厳密なテストベッドと、ドキュメント理解における知覚とロジックの相乗効果についての新しい視点を提供する。

関連論文リスト

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
我々は、ステレオカメラ、LiDAR、IMU/GPSセンサーで撮影された歩行者の視線映像から構築したベンチマークを紹介する。このデータセットは、計量的に正確な3D情報を提供し、空間的推論質問の自動生成を可能にする。評価の結果、構造化屋内ベンチマークで観測された性能向上は、オープンワールド環境では消滅することが明らかとなった。
論文参考訳（メタデータ） (2025-12-22T18:58:12Z)
Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective [17.592210658831902]
空間推論は、人間の知性の中核的な側面であり、3D環境における知覚、推論、計画を可能にする。現在の視覚言語モデル(VLM)は、多視点設定における空間的推論のための幾何学的コヒーレンスとクロスビュー整合性を維持するのに苦労している。本稿では,VLMが相補的な視点で空間的メンタルモデルを構築し,調整し,維持する方法を評価するための,認知的基盤を持つベンチマークであるReMindView-Benchを紹介する。
論文参考訳（メタデータ） (2025-12-02T02:21:29Z)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Benchは、大規模言語モデル(LLM)のブレークスルーベンチマークである。 LLMの評価を抽象的なスコアから直接観察可能な視覚出力に変換する。 LTD-Benchの視覚出力は強力な診断分析を可能にし、モデル類似性を調べるための潜在的アプローチを提供する。
論文参考訳（メタデータ） (2025-11-04T08:11:23Z)
Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture [16.15618237704827]
データと建築の両面から空間的理解を体系的に分析する。データの観点からは、トレーニングデータが増加するにつれて空間理解の性能は急速に収束する。アーキテクチャの観点からは、空間的理解は言語モデルよりも視覚エンコーダ内の位置エンコーダに大きく依存していることが分かる。
論文参考訳（メタデータ） (2025-09-02T14:22:43Z)
Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation [50.81551581148339]
本稿では、推論に基づくセグメンテーションフレームワークRelevant Reasoning(R$2$S)を紹介する。推論に基づくセグメンテーションデータセットである3D ReasonSegについても紹介する。どちらの実験も、R$2$Sと3D ReasonSegは、空間的推論能力の強い3D点雲知覚を効果的に達成することを示した。
論文参考訳（メタデータ） (2025-06-29T06:58:08Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
視覚言語モデル (VLM) は視覚的内容の理解と推論において顕著な能力を示した。現在のVLMは、主に自我中心の空間的推論(カメラの観点から)に優れるが、同中心の視点に一般化することができない。マルチ視点空間位置認識評価に特化して設計された,初の総合的なベンチマークであるViewSpatial-Benchを紹介する。
論文参考訳（メタデータ） (2025-05-27T17:59:26Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
マルチモーダル大規模言語モデル(MLLM)は,質問応答タスクにおいて顕著な成功を収めているが,空間的理解能力は乏しい。既存のMLLMは3次元空間認識と理解能力を持っているか?
論文参考訳（メタデータ） (2025-05-22T17:59:03Z)
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [13.768090541138571]
視覚言語モデル(VLM)はオブジェクトの識別と記述に優れるが、しばしば空間的推論では失敗する。視覚トークンの埋め込みは、テキストトークンよりもはるかに大きな規範を持っている。視覚トークンとシステムが注目を惹きつけることを明らかにするツール。
論文参考訳（メタデータ） (2025-03-21T17:51:14Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。