Fugu-MT 論文翻訳(概要): Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

論文の概要: Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

arxiv url: http://arxiv.org/abs/2504.09480v1
Date: Sun, 13 Apr 2025 08:28:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-23 07:00:55.085345
Title: Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Title（参考訳）: 物体検出とセグメンテーションのための視覚言語モデル:レビューと評価
Authors: Yongchao Feng, Yajie Liu, Shuai Yang, Wenrui Cai, Jinqing Zhang, Qiqi Zhan, Ziyue Huang, Hongxi Yan, Qiao Wan, Chenguang Liu, Junzhe Wang, Jiahui Lv, Ziqi Liu, Tengyuan Shi, Qingjie Liu, Yunhong Wang,
Abstract要約: VLM(Vision-Language Model)は、OV(Open-Vocabulary)オブジェクトの検出とセグメンテーションタスクにおいて広く採用されている。それらはOV関連タスクを約束しているにもかかわらず、従来のビジョンタスクの有効性は評価されていない。
参考スコア（独自算出の注目度）: 38.20492321295552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at https://github.com/better-chao/perceptual_abilities_evaluation.
Abstract（参考訳）: VLM(Vision-Language Model)は、OV(Open-Vocabulary)オブジェクトの検出とセグメンテーションタスクにおいて広く採用されている。それらはOV関連タスクを約束しているにもかかわらず、従来のビジョンタスクの有効性は評価されていない。本稿では,VLMに基づく検出とセグメンテーションの体系的レビューを行い,VLMを基礎モデルとみなし,複数の下流タスクを対象とした総合的な評価を初めて実施する。 1)評価は、8つの検出シナリオ(クローズドセット検出、ドメイン適応、混み合ったオブジェクトなど)と8つのセグメンテーションシナリオ(フェーショット、オープンワールド、小さなオブジェクトなど)にまたがっており、タスク間で異なるパフォーマンス上の利点と様々なVLMアーキテクチャの制限を明らかにしている。 2) 検出タスクは,3つの微調整された粒度: \textit{zero prediction}, \textit{visual fine-tuning}, \textit{text prompt} で評価し,さらに異なる微調整戦略が様々なタスクにおける性能に与える影響を分析する。 3)実験結果に基づいて,タスク特性,モデルアーキテクチャ,トレーニング方法論の相関関係を詳細に分析し,今後のVLM設計の知見を提供する。 4)本研究は,コンピュータビジョン,マルチモーダルラーニング,ビジョン基礎モデルの分野に携わるパターン認識の専門家にとって,この問題にそれらを導入し,今後の研究に期待できる方向性を提供しつつ,現在の進捗状況に精通させることで,有用であると信じている。このレビューと評価に関連するプロジェクトがhttps://github.com/better-chao/perceptual_abilities_evaluationで作成されている。

論文の概要: Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

関連論文リスト