Fugu-MT 論文翻訳(概要): VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

論文の概要: VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

arxiv url: http://arxiv.org/abs/2603.26015v1
Date: Fri, 27 Mar 2026 02:16:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.327993
Title: VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation
Title（参考訳）: VLAgeBench: ゼロショット人間の年齢推定のための大規模視覚言語モデルのベンチマーク
Authors: Rakib Hossain Sajib, Md Kishor Morol, Rajan Das Gupta, Mohammad Sakib Mahmood, Shuvra Smaran Das,
Abstract要約: 本研究では,顔年齢推定のための大規模視覚言語モデル(LVLM)の総合的ゼロショット評価を提案する。汎用LVLMはゼロショット設定で競合性能を提供できることを示す。この研究は、LVLMを、法医学、医療監視、人間とコンピュータの相互作用における現実の応用のための有望なツールとして位置づけている。
参考スコア（独自算出の注目度）: 0.19573380763700718
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.
Abstract（参考訳）: 顔画像からの人間の年齢推定は、バイオメトリックス、医療、人間とコンピュータの相互作用において重要な応用を持つコンピュータビジョンの課題である。従来のディープラーニングアプローチはラベル付きデータセットとドメイン固有のトレーニングを必要とするが、大規模視覚言語モデル(LVLM)の最近の進歩は、ゼロショット年齢推定の可能性を秘めている。本研究では,従来のドメイン固有の畳み込みネットワークと教師付き学習に支配される課題である顔面年齢推定のための,最先端のLVLM(Large Vision-Language Models)の包括的なゼロショット評価を提案する。 GPT-4o, Claude 3.5 Sonnet, LLaMA 3.2 Visionの2つのベンチマークデータセット(UTKFaceとFG-NET)において、微調整やタスク固有の適応を伴わずに、GPT-4o, Claude 3.5 Sonnet, LLaMA 3.2 Visionの性能を評価する。 MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, $\pm$5-year といった8つの評価指標を用いて, 汎用LVLMがゼロショット設定で競合性能を提供できることを示した。以上の結果から,LVLMが生長推定に有効であることを示すとともに,これらのモデルを現実の応用に有望なツールとして位置づけることができた。さらに,画像の品質や階層的サブグループに関連する性能格差を強調し,公平性を意識したマルチモーダル推論の必要性を強調した。この研究は再現可能なベンチマークを導入し、LVLMを法科学、医療監視、人間とコンピュータの相互作用における現実の応用のための有望なツールとして位置づける。このベンチマークでは、微調整なしで厳密なゼロショット推論に焦点が当てられ、迅速な感度、解釈可能性、計算コスト、人口統計学的公正性に関連する残りの課題を強調している。

論文の概要: VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

関連論文リスト