Fugu-MT 論文翻訳(概要): MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

論文の概要: MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

arxiv url: http://arxiv.org/abs/2605.03485v1
Date: Tue, 05 May 2026 08:20:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:43.834651
Title: MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models
Title（参考訳）: MHPR:大規模視覚言語モデルのための多次元人間の知覚と推論ベンチマーク
Authors: Kangkang Wang, Qinting Jiang, Wanping Zhang, Bowen Ren, Shengzhao Wen,
Abstract要約: 我々は、人間中心のシーンに対する共同認識推論のベンチマークであるMHPRを紹介する。 MHPRは、多レベルデータデザイン・キャプションドローデータ(C-RD)、スーパービジョンドファインチューニングデータ(SFT-D)、強化学習データ(RL-D)、テストデータ(T-D)からなる。細粒度属性とハイレベルセマンティクスに基づいて、最先端の視覚言語モデルを評価する。
参考スコア（独自算出の注目度）: 2.1348189297234685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a comprehensive benchmark for joint perception-reasoning over human-centric scenes spanning individual, multi-person, and human-object interaction dimensions. MHPR comprises a multi-level data design-Captioned Raw Data (C-RD), Supervised Fine-Tuning Data (SFT-D), Reinforcement Learning Data (RL-D), and Test Data (T-D)-together with an automated caption/VQA generation pipeline (ACVG) that performs category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting to ensure high-quality, scalable annotations. We evaluate state-of-the-art vision-language models on fine-grained attributes (appearance, clothing, pose, parts) and high-level semantics (social relations, action semantics, spatial relations, intent and functionality). Our findings show that: 1) format-aligned SFT data substantially improves instruction following and stability; 2) challenge-focused RL data derived from bad-case analysis further enhances perception and reasoning on difficult instances; and 3) training Qwen2.5-VL-7B with MHPR yields significant gains, achieving near-parity with considerably larger models. We release ACVG and MHPR to facilitate reproducible, extensible research on human-centric perception and reasoning.
Abstract（参考訳）: フィルム分析や仮想デジタル人間のような現実世界の応用には多次元の人間の理解が不可欠であるが、現在のLVLMベンチマークは主に単一タスクの設定に焦点を合わせており、細粒度で人中心的な評価を欠いている。本研究では,個人,複数対人,対物的相互作用の次元にまたがる人間中心シーンに対する共同知覚推論のための総合的ベンチマークであるMHPRを紹介する。 MHPRは、カテゴリワイド属性の分解、属性固有の書き換え、および高品質でスケーラブルなアノテーションを保証するためにマルチモデル投票を行う自動キャプション/VQA生成パイプライン(ACVG)を備えたマルチレベルデータ設計内蔵Raw Data(C-RD)、スーパービジョンファインチューニングデータ(SFT-D)、強化学習データ(RL-D)、テストデータ(T-D)トゲザーを含む。我々は, 細粒度属性(外観, 衣服, ポーズ, 部品)と高レベルの意味論(社会的関係, 行動意味論, 空間関係, 意図, 機能)に基づいて, 最先端の視覚言語モデルを評価する。我々の研究結果は以下のとおりである。 1) 書式整列SFTデータは、命令追従及び安定性を大幅に改善する。 2)難解ケース分析から得られた課題中心のRLデータにより、困難な事例に対する認識と推論がさらに強化される。 3) MHPRを用いたQwen2.5-VL-7Bのトレーニングでは, かなり大きなモデルでほぼ同程度に向上した。 ACVGとMHPRを発売し、人間中心の知覚と推論に関する再現性、拡張性のある研究を促進する。

論文の概要: MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

関連論文リスト