Fugu-MT 論文翻訳(概要): Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

論文の概要: Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

arxiv url: http://arxiv.org/abs/2603.14012v1
Date: Sat, 14 Mar 2026 16:33:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.544873
Title: Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification
Title（参考訳）: 領域一般化人物再同定のための多点視覚言語アライメント
Authors: Jiachen Li, Xiaojin Gong, Dongping Zhang,
Abstract要約: 一般人物再識別(DG Re-ID)は、モデルがソースドメインでトレーニングされるが、見えないターゲットドメインでテストされる、困難なタスクである。近年,視覚言語モデル (VLM) は様々な視覚的応用において優れた一般化能力を示す。本稿では,CLIPに基づく多粒度視覚言語アライメントフレームワークを提案する。
参考スコア（独自算出の注目度）: 15.307492395180658
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.
Abstract（参考訳）: ドメイン一般化人再識別(DG Re-ID)は、モデルがソースドメインでトレーニングされるが、目に見えないターゲットドメインでテストされる、困難なタスクである。従来の純粋な視覚ベースのモデルは大きな進歩を遂げたものの、性能はさらに改善されている。近年,視覚言語モデル (VLM) は様々な視覚的応用において優れた一般化能力を示す。しかしながら、VLMをRe-IDに直接適用すると、一般化の改善は限定的である。これは、VLMはIDニュアンスに敏感なグローバル機能しか生成しないためである。この問題に対処するため,本研究では,CLIPに基づく多粒度視覚言語アライメントフレームワークを提案する。具体的には、複数の多義的なプロンプトが言語モダリティに導入され、異なる身体部位を記述し、視覚のモダリティにおいて対応する部分と整合する。微粒な視覚情報を得るために、適応的にマスクされたマルチヘッド自己保持モジュールを用いて特定部分の特徴を正確に抽出する。提案モジュールのトレーニングには,MLLMをベースとした視覚接地専門家を用いて,身体部品の擬似ラベルを自動的に生成する。単一および複数ソースの一般化プロトコル上で行った大規模な実験は、我々のアプローチの優れた性能を示す。実装コードはhttps://github.com/RikoLi/MUVA.comでリリースされる。

論文の概要: Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

関連論文リスト