Fugu-MT 論文翻訳(概要): FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

論文の概要: FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

arxiv url: http://arxiv.org/abs/2605.19846v3
Date: Sat, 23 May 2026 07:31:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 16:32:37.758594
Title: FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
Title（参考訳）: FineBench: きめ細かい人間の活動理解のためのビジョンランゲージモデルのベンチマークと強化
Authors: Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu,
Abstract要約: VLM(Vision-Language Models)は、一般的なビデオ理解において顕著な能力を示す。彼らはしばしば、現実世界のアプリケーションに不可欠なきめ細かい理解に苦しむ。我々は、きめ細かい理解を評価するために特別に設計されたベンチマークであるFineBenchを紹介する。
参考スコア（独自算出の注目度）: 30.42523020030251
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.
Abstract（参考訳）: VLM(Vision-Language Models)は、一般的なビデオ理解において顕著な能力を示してきたが、人間の行動やインタラクションの微妙な解釈を必要とする現実世界のアプリケーションにとって重要な、きめ細かい理解に苦慮することが多い。最近の人間中心のベンチマークでは、公平さ/倫理、感情知覚、より広範な人間中心のメトリクスといったモデル行動の側面を評価しているが、長いビデオ、非常に密集したQAカバレッジ、大規模なフレームレベルの空間的/時間的接地を組み合わせていない。このギャップを埋めるために,人間中心のビデオ質問応答(VQA)ベンチマークであるFineBenchを導入する。 FineBenchは、199,420のマルチチョイスQAペアを64の長ビデオ(各15分)に密に注釈付けし、詳細な人物の動き、人物のインタラクション、そして作曲動作を含むオブジェクト操作に焦点を当てている。 GPT-5のようなプロプライエタリなモデルが優れた性能を発揮する一方で、現在のオープンソースVLMは、特にマルチパーソンシーンにおける空間的推論と、人間の動きや相互作用の微妙な違いの区別に苦慮している。これらの弱点に対処するため、我々はLocalizerとDescriptorを活用してVLMを強化するモジュラーフレームワークであるFineAgentを提案する。実験により、FineAgentはFineBench上の様々なオープンVLMの性能を一貫して改善することが示された。 FineBenchは、人間中心のビデオ理解の詳細な研究のための厳密なテストベッドを提供する一方、FineAgentは現在のVLMにおけるそのような推論を強化するための実践的なアプローチを提供する。プロジェクトページとコードはhttps://joslefaure.github.io/assets/html/finebench.htmlにある。

論文の概要: FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

関連論文リスト