Fugu-MT 論文翻訳(概要): The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?

論文の概要: The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?

arxiv url: http://arxiv.org/abs/2508.21143v2
Date: Wed, 08 Oct 2025 07:49:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 14:21:18.110968
Title: The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
Title（参考訳）: パーセプションVの課題:マルチモーダルLLMは単純なパーセプション問題に対処できるか?
Authors: Samrajnee Ghosh, Naman Agarwal, Hemanshu Garg, Chinmay Mittal, Mausam, Parag Singla,
Abstract要約: 6000個のプログラム生成された未汚染画像を含むデータセットであるPercept-Vを30の領域に分割した。そのため、ドメインを極めてシンプルにし、それを解決するのに必要な推論と知識を最小限にします。我々の考えに反して、我々の実験は、Percept-Vの非常に高い人的性能と比較して、SoTAのプロプライエタリかつオープンソースMLLMの弱い性能を示している。
参考スコア（独自算出の注目度）: 23.22049250636057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cognitive science research treats visual perception, the ability to understand and make sense of a visual input, as one of the early developmental signs of intelligence. Its TVPS-4 framework categorizes and tests human perception into seven skills such as visual discrimination, and form constancy. Do Multimodal Large Language Models (MLLMs) match up to humans in basic perception? Even though there are many benchmarks that evaluate MLLMs on advanced reasoning and knowledge skills, there is limited research that focuses evaluation on simple perception. In response, we introduce Percept-V, a dataset containing 6000 program-generated uncontaminated images divided into 30 domains, where each domain tests one or more TVPS-4 skills. Our focus is on perception, so we make our domains quite simple and the reasoning and knowledge required for solving them are minimal. Since modern-day MLLMs can solve much more complex tasks, our a-priori expectation is that they will solve these domains very easily. Contrary to our belief, our experiments show a weak performance of SoTA proprietary and open-source MLLMs compared to very high human performance on Percept-V. We find that as number of objects in the image increases, performance goes down rather fast. Our experiments also identify the perception skills that are considerably harder for all models.
Abstract（参考訳）: 認知科学研究は、視覚的知覚、視覚的インプットを理解し、理解する能力を、知性の発達初期の兆候の1つとして扱う。そのTVPS-4フレームワークは、人間の知覚を視覚的識別や形態の一貫性といった7つのスキルに分類し、テストする。 MLLM(Multimodal Large Language Models)は、人間に基本的な知覚で一致するか? 高度な推論と知識スキルに基づいてMLLMを評価するベンチマークは数多く存在するが、単純な知覚に焦点を絞った研究は限られている。そこで本研究では,プログラム生成した6000個の未汚染画像を含むデータセットPercept-Vを30の領域に分割し,各ドメインが1つ以上のTVPS-4スキルをテストする。そのため、ドメインを極めてシンプルにし、それを解決するのに必要な推論と知識を最小限にします。現代のMLLMはより複雑なタスクを解くことができるので、我々はこれらのドメインを非常に簡単に解決できると期待している。我々の考えに反して、我々の実験は、Percept-Vの非常に高い人的性能と比較して、SoTAのプロプライエタリかつオープンソースMLLMの弱い性能を示している。画像内のオブジェクト数が増加するにつれて、パフォーマンスがかなり速くなります。実験では、全てのモデルにとってはるかに難しい知覚スキルも同定した。

論文の概要: The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?

関連論文リスト