Fugu-MT 論文翻訳(概要): MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

論文の概要: MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

arxiv url: http://arxiv.org/abs/2310.02255v1
Date: Tue, 3 Oct 2023 17:57:24 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-04 12:47:49.936072
Title: MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Title（参考訳）: MathVista:ビジュアルコンテキストにおける基礎モデルの数学的推論の評価
Authors: Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao
Abstract要約: 大規模言語モデル(LLM)とLMM(Large Multimodal Models)は、様々な領域において優れたスキルを示すが、視覚的文脈における数学的推論能力は公式には検討されていない。 MathVistaは、様々な数学的・視覚的なタスクから課題を解き放つために設計されたベンチマークである。 MathVistaは、数学的に集約的で視覚的にリッチな現実世界のタスクに対処できる汎用AIエージェントの開発において、将来の研究を加速させる。
参考スコア（独自算出の注目度）: 170.01089233942594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive skills in various domains, their ability for mathematical reasoning within visual contexts has not been formally examined. Equipping LLMs and LMMs with this capability is vital for general-purpose AI assistants and showcases promising potential in education, data analysis, and scientific discovery. To bridge this gap, we present MathVista, a benchmark designed to amalgamate challenges from diverse mathematical and visual tasks. We first taxonomize the key task types, reasoning skills, and visual contexts from the literature to guide our selection from 28 existing math-focused and visual question answering datasets. Then, we construct three new datasets, IQTest, FunctionQA, and PaperQA, to accommodate for missing types of visual contexts. The problems featured often require deep visual understanding beyond OCR or image captioning, and compositional reasoning with rich domain-specific tools, thus posing a notable challenge to existing models. We conduct a comprehensive evaluation of 11 prominent open-source and proprietary foundation models (LLMs, LLMs augmented with tools, and LMMs), and early experiments with GPT-4V. The best-performing model, Multimodal Bard, achieves only 58% of human performance (34.8% vs 60.3%), indicating ample room for further improvement. Given this significant gap, MathVista fuels future research in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. Preliminary tests show that MathVista also presents challenges to GPT-4V, underscoring the benchmark's importance. The project is available at https://mathvista.github.io/.
Abstract（参考訳）: 大規模言語モデル(llm)と大規模マルチモーダルモデル(lmm)は様々な領域において印象的なスキルを示すが、視覚的文脈における数学的推論能力は正式には検討されていない。この能力でLLMとLMMを入手することは、汎用AIアシスタントにとって不可欠であり、教育、データ分析、科学的発見において有望な可能性を示す。このギャップを埋めるために,さまざまな数学的および視覚的タスクの課題を回避すべく設計されたベンチマークであるmathvistaを提案する。まず,本論文から重要タスクタイプ,推論スキル,視覚的コンテキストを分類し,既存の28の算数および視覚的質問応答データセットから選択する。そして、欠落した視覚的コンテキストに対応するために、IQTest、FunctionQA、PaperQAという3つの新しいデータセットを構築します。しばしば取り上げられる問題は、ocrや画像キャプションを超えた深い視覚的理解と、リッチなドメイン固有のツールによる構成的推論を必要とするため、既存のモデルにとって大きな課題となる。我々は,11の著名なオープンソースおよびプロプライエタリ基盤モデル(LLM,ツール付きLLM,LMM)を総合的に評価し,GPT-4Vによる早期実験を行った。最高のパフォーマンスモデルであるMultimodal Bardは、人間のパフォーマンスの58%しか達成していない(34.8%対60.3%)。この大きなギャップを考えると、mathvistaは、数学的に集中的で視覚的にリッチな実世界のタスクに取り組むことができる汎用aiエージェントの開発における将来の研究を加速する。予備的なテストでは、MathVistaはGPT-4Vにも課題を示し、ベンチマークの重要性を強調している。プロジェクトはhttps://mathvista.github.io/で入手できる。

論文の概要: MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

関連論文リスト