Fugu-MT 論文翻訳(概要): InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

論文の概要: InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

arxiv url: http://arxiv.org/abs/2311.11567v3
Date: Mon, 4 Dec 2023 20:55:53 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-06 18:58:54.512672
Title: InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models
Title（参考訳）: InfiMM-Eval:マルチモーダル大言語モデルに対する複雑なオープンエンディング推論評価
Authors: Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, Hongxia Yang
Abstract要約: MLLM(Multi-modal Large Language Models)は人工知能の分野で注目されている。本ベンチマークは, 帰納的, 帰納的, 類推的推論の3つの主要な推論カテゴリから構成される。我々は,この厳密に開発されたオープンエンド多段階精巧な推論ベンチマークを用いて,代表MLLMの選択を評価する。
参考スコア（独自算出の注目度）: 50.03163753638256
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. In instances where an MLLM is unable to produce a definitive answer, its reasoning ability is evaluated by requesting intermediate reasoning steps. If these steps align with our manual annotations, appropriate scores are assigned. This evaluation scheme resembles methods commonly used in human assessments, such as exams or assignments, and represents what we consider a more effective assessment technique compared with existing benchmarks. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark, designed to challenge and accurately measure their reasoning capabilities. The code and data will be released at https://infimm.github.io/InfiMM-Eval/
Abstract（参考訳）: MLLM(Multi-modal Large Language Models)は人工知能の分野で注目されている。これらのモデルは従来の視覚言語タスクに優れるだけでなく、現代のマルチモーダルベンチマークでも顕著な性能を示している。これらのベンチマークの多くはMLLMの全体評価を試みているが、一般的には基本的な推論タスクに集中しており、単純なye/no や multi-choice の応答しか得られない。これらの手法は自然にMLLMの推論能力を決定するのに混乱と困難をもたらす。この問題を軽減するため,MLLM向けに設計されたベンチマークデータセットを手作業でキュレートし,複雑な推論タスクに着目した。我々のベンチマークは3つの主要な推論カテゴリで構成されている。我々のデータセットのクエリは、MLLMの推論能力に対処するために意図的に構築されています。各種MLLMを公平に比較するため,評価基準に中間的推論ステップを組み込んだ。 MLLMが決定的な答えを生成できない場合、その推論能力は中間的推論ステップを要求することによって評価される。これらのステップが手動アノテーションと一致すれば、適切なスコアが割り当てられます。この評価スキームは、試験や課題など人間の評価によく用いられる手法に似ており、既存のベンチマークと比較すると、より効果的な評価手法と考えるものを表している。我々は,この厳密に開発された多段階多段階的推論ベンチマークを用いて,代表的mllmの選択を評価し,その推論能力に挑戦し,正確に評価する。コードとデータはhttps://infimm.github.io/InfiMM-Eval/で公開される。

関連論文リスト

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
大規模言語モデル(LLM)を評価するために最も広く使われているタスクの1つは、Multiple-Choice Question Answering (MCQA)である。本研究は,MCQA評価戦略の不整合を軽視し,不正確かつ誤ったモデル比較に繋がる可能性がある。
論文参考訳（メタデータ） (2025-03-19T08:45:03Z)
CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
本研究では, 大規模言語モデル(LLM)の性能評価を行った。我々は,新しいベンチマークデータセットであるCounterBenchを紹介した。
論文参考訳（メタデータ） (2025-02-16T06:19:37Z)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEvalは、組み込みタスクを持つMLLMの総合的かつインタラクティブな評価ベンチマークである。多様性が大幅に向上した既存のAIタスクの幅広い範囲をカバーする。 EmbodiedEval における最先端MLLM の評価を行い,人体作業における人体レベルと比較して有意に不足していることがわかった。
論文参考訳（メタデータ） (2025-01-21T03:22:10Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
大規模言語モデル(LLM)は、問題解決と意思決定の能力の向上を示している。本稿ではメタ推論技術を必要とするプロセスベースのベンチマークMR-Benを提案する。メタ推論のパラダイムは,システム2のスロー思考に特に適しています。
論文参考訳（メタデータ） (2024-06-20T03:50:23Z)
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [68.83624133567213]
そこで本研究では,MLLMの最も普及している手法が,その問題に先入観を導入することで,容易に騙せることを示す。また, モデルに対して, 合成推論を積極的に行うための, 単純かつ効果的な手法であるアクティブ推論(AD)を提案する。
論文参考訳（メタデータ） (2024-04-19T15:53:27Z)
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models [34.91372939329467]
MLLMの純粋推論能力を評価するためのベンチマークであるNPHardEval4Vを導入する。異なるモデルにまたがる推論能力に有意な差が認められた。また,視覚,テキスト,視覚とテキストの組み合わせがMLLMの推論能力に与える影響についても検討した。
論文参考訳（メタデータ） (2024-03-04T07:10:31Z)
Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark [39.64489055580211]
実データと反実データからなる新しい評価ベンチマークであるCofCA(Step-wise Counterfactual benchmark)を導入する。実験の結果,ウィキペディアをベースとした事実データと反事実データの間には,既存のベンチマークにおけるデータ汚染問題を推定し,大きな性能差があることが判明した。
論文参考訳（メタデータ） (2024-02-19T08:12:30Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
本稿では,新しいタスク,CLOMO(Counterfactual Logical Modification)と高品質な人間アノテーションベンチマークを紹介する。このタスクでは、LLMは所定の論理的関係を維持するために、与えられた議論的テキストを順応的に変更しなければなりません。 LLMの自然言語出力を直接評価する革新的な評価指標である自己評価スコア(SES)を提案する。
論文参考訳（メタデータ） (2023-11-29T08:29:54Z)
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
MLLM(Multimodal large language model)は、AIアプリケーションの範囲を広げている。既存のMLLMの自動評価手法は主にユーザエクスペリエンスを考慮せずにクエリを評価する場合に限られている。本稿では,MLLM を判断基準として評価する MLLM の新しい評価パラダイムを提案する。
論文参考訳（メタデータ） (2023-11-23T12:04:25Z)
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
マルチモーダル言語モデル(MLLM)は、マルチモーダルタスクを実行するために強力なLLMに依存している。本稿では,MLLM 評価ベンチマーク MME について述べる。知覚能力と認知能力の両方を合計14のサブタスクで測定する。
論文参考訳（メタデータ） (2023-06-23T09:22:36Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。