Fugu-MT 論文翻訳(概要): What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

論文の概要: What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

arxiv url: http://arxiv.org/abs/2505.19569v1
Date: Mon, 26 May 2025 06:33:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-27 16:58:43.217482
Title: What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
Title（参考訳）: 知覚するもの:オープンな語彙画像セグメンテーションのための認知にインスパイアされたフレームワーク
Authors: Jianghang Lin, Yue Hu, Jiangtao Shen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, Rongrong Ji,
Abstract要約: オープン語彙のイメージセグメンテーションは、推論時に動的に調整可能で事前定義された新しいカテゴリを認識するという課題に取り組む。既存のパラダイムは通常、クラスに依存しない領域のセグメンテーションを実行し、続いてカテゴリマッチングを行い、領域のセグメンテーションとターゲット概念の整合性が劣る。人間の視覚認識過程をエミュレートするオープン語彙画像セグメント化のための新しい認知刺激フレームワークを提案する。
参考スコア（独自算出の注目度）: 65.80512502962071
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.
Abstract（参考訳）: オープン語彙画像セグメンテーションは、視覚言語アライメントを活用することにより、推論時に動的に調整可能な、事前定義された新しいカテゴリを認識するという課題に取り組む。しかし、既存のパラダイムは通常、クラスに依存しない領域のセグメンテーションを行い、次にカテゴリマッチングを行い、それは人間の視覚システムによる意味論的概念に基づくオブジェクトの認識プロセスから逸脱し、領域のセグメンテーションとターゲット概念の整合性が低下する。このギャップを埋めるために,人間の視覚認識過程をエミュレートするオープンボキャブラリ画像セグメンテーションのための新しいコグニション・インスピレーションド・フレームワークを提案する。このフレームワークは、3つのコアコンポーネントから構成されている。(1) 領域分割のセマンティックガイダンスを提供するためにオブジェクト概念を生成することによって人間の認知を模倣する生成視覚言語モデル(G-VLM)。 2) テキスト概念とグローバルな視覚表現を融合した概念認識型ビジュアルエンハンサーモジュールにより,対象概念に基づいた適応的な視覚知覚が可能となる。 (3) 局所的なインスタンス特徴とG-VLMが提供するセマンティックキューを統合し、関連するカテゴリのサブセットを選択的に分類する認知インスピレーションデコーダ。大規模な実験により、我々のフレームワークは27.2ドルのPQ、17.0ドルのmAP、A-150で35.3ドルのmIoUに到達した。さらに、Cityscapes、Mapillary Vistas、A-847、PC-59、PC-459、PAS-20で56.2$、28.2$、15.4$、59.2$、18.7$、9.8$mIoUを獲得している。さらに,このフレームワークは語彙のないセグメンテーションをサポートし,未知のカテゴリを認識する柔軟性の向上を実現している。コードは公開されます。

論文の概要: What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

関連論文リスト