Fugu-MT 論文翻訳(概要): CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification

論文の概要: CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification

arxiv url: http://arxiv.org/abs/2501.16065v1
Date: Mon, 27 Jan 2025 14:08:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-01-28 21:57:03.945215
Title: CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification
Title（参考訳）: CILP-FGDI: 一般化可能な人物再同定のためのビジョンランゲージモデルの構築
Authors: Huazhong Zhao, Lei Qi, Xin Geng,
Abstract要約: CLIP(Contrastive Language- Image Pretraining)は,大規模画像テキストペア上で事前訓練された視覚言語モデルである。 CLIPのタスクへの適応は、識別能力を高めるためによりきめ細かい機能を学ぶことと、モデルの一般化能力を改善するためによりドメイン不変の機能を学ぶ、という2つの大きな課題を示す。
参考スコア（独自算出の注目度）: 42.429118831928214
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Visual Language Model, known for its robust cross-modal capabilities, has been extensively applied in various computer vision tasks. In this paper, we explore the use of CLIP (Contrastive Language-Image Pretraining), a vision-language model pretrained on large-scale image-text pairs to align visual and textual features, for acquiring fine-grained and domain-invariant representations in generalizable person re-identification. The adaptation of CLIP to the task presents two primary challenges: learning more fine-grained features to enhance discriminative ability, and learning more domain-invariant features to improve the model's generalization capabilities. To mitigate the first challenge thereby enhance the ability to learn fine-grained features, a three-stage strategy is proposed to boost the accuracy of text descriptions. Initially, the image encoder is trained to effectively adapt to person re-identification tasks. In the second stage, the features extracted by the image encoder are used to generate textual descriptions (i.e., prompts) for each image. Finally, the text encoder with the learned prompts is employed to guide the training of the final image encoder. To enhance the model's generalization capabilities to unseen domains, a bidirectional guiding method is introduced to learn domain-invariant image features. Specifically, domain-invariant and domain-relevant prompts are generated, and both positive (pulling together image features and domain-invariant prompts) and negative (pushing apart image features and domain-relevant prompts) views are used to train the image encoder. Collectively, these strategies contribute to the development of an innovative CLIP-based framework for learning fine-grained generalized features in person re-identification.
Abstract（参考訳）: 堅牢なクロスモーダル機能で知られるVisual Language Modelは、様々なコンピュータビジョンタスクに広く応用されている。本稿では,大規模画像テキストペア上で事前訓練された視覚言語モデルであるCLIP(Contrastive Language- Image Pretraining)を用いて,視覚的特徴とテキスト的特徴の整合性について検討する。 CLIPのタスクへの適応は、識別能力を高めるためによりきめ細かい機能を学ぶことと、モデルの一般化能力を改善するためによりドメイン不変の機能を学ぶ、という2つの大きな課題を示す。これにより、第1の課題を軽減し、細粒度の特徴を学習する能力を高めるため、テキスト記述の精度を高めるための3段階戦略が提案されている。当初、イメージエンコーダは、人物の再識別タスクに効果的に適応するように訓練されている。第2段階では、画像エンコーダによって抽出された特徴を用いて、各画像のテキスト記述(即ちプロンプト)を生成する。最後に、学習プロンプト付きテキストエンコーダを用いて、最終画像エンコーダのトレーニングを指導する。モデルの一般化能力を未確認領域に高めるために,ドメイン不変の画像特徴を学習するための双方向誘導手法を導入した。具体的には、ドメイン不変プロンプトとドメイン関連プロンプトを生成し、イメージエンコーダのトレーニングには、正(画像特徴とドメイン不変プロンプトをまとめる)と負(画像特徴とドメイン関連プロンプトを分割する)の両方のビューを使用する。これらの戦略は、個人の再識別において、きめ細かい一般化された特徴を学習する革新的なCLIPベースのフレームワークの開発に寄与する。

関連論文リスト

Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization [75.88719716002014]
ドメイン一般化 (Domain Generalization, DG) は、目に見えないターゲットドメインに対して効果的に機能する汎用モデルの開発を目指している。 VFM(Pre-trained Visual Foundation Models)の最近の進歩は、ディープラーニングモデルの一般化能力を向上する大きな可能性を示している。 VFMの制御可能で柔軟な言語プロンプトを活用することで,この問題に対処することを提案する。
論文参考訳（メタデータ） (2025-07-03T03:52:37Z)
Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPSは、ビジュアルフィーチャ強化テキストベースのPerson Searchモデルである。基本的なマルチモーダル機能を学ぶために、トレーニング済みのバックボーンCLIPを導入する。 Text Guided Masked Image Modelingタスクを構築し、局所的な視覚的詳細を学習するモデルの能力を強化する。
論文参考訳（メタデータ） (2024-12-30T01:38:14Z)
CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
person re-identification (ReID) は Contrastive Language-Image Pre-Training (CLIP) のような大規模な事前訓練された視覚言語モデルの恩恵を受けている。本稿では、既存の画像キャプションモデルを利用して人物画像の擬似キャプションを生成する方法を提案する。 CLIP-SCGI(CLIP-SCGI)は、合成キャプションを利用して、差別的・堅牢な表現の学習をガイドするフレームワークである。
論文参考訳（メタデータ） (2024-10-12T06:24:33Z)
Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [60.20318058777603]
一般化可能な車両再識別(ReID)は、微調整や再訓練を必要とせず、未知のターゲットドメインに適応可能なモデルの開発を目指している。これまでの研究は主に、ソースドメイン間のデータ分散を調整することで、ドメイン不変の機能の抽出に重点を置いてきた。そこで本研究では,この問題を解決するために,2段階のMulti-expert Knowledge Confrontation and Collaboration(MiKeCoCo)手法を提案する。
論文参考訳（メタデータ） (2024-07-10T04:06:39Z)
ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization [0.0]
そこで本稿では,視覚性能を向上させるための2段階学習手法を提案する。提案手法の有効性を複数の大規模視覚的ジオローカライゼーションデータセットで検証する。
論文参考訳（メタデータ） (2024-06-04T02:28:51Z)
WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangleative discriminative visual representation。まず、ドメイン固有の言語を適応的に識別し、削除するために使用可能な、きめ細かいアライメントを組み込んだ言語を推定する。 WIDInは、CLIPのような事前訓練された視覚言語モデルと、MoCoやBERTのような個別訓練されたユニモーダルモデルの両方に適用可能であることを示す。
論文参考訳（メタデータ） (2024-05-28T17:46:27Z)
Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
単一ソースドメインの一般化は、より信頼性が高く一貫性のあるイメージセグメンテーションを現実の臨床環境にわたって約束する。本稿では,テキストエンコーダ機能によって案内されるコントラスト学習機構を組み込むことで,テキスト情報を明確に活用する手法を提案する。文献における既存手法に対して,本手法は良好な性能を発揮する。
論文参考訳（メタデータ） (2024-04-01T17:48:15Z)
Improving Generalization of Image Captioning with Unsupervised Prompt Learning [63.26197177542422]
画像キャプションの一般化(GeneIC)は、アノテーション付きデータを必要とせずに、ターゲットドメインのドメイン固有のプロンプトベクトルを学習する。 GeneICは、学習済みのContrastive Language-Image Pre-Training (CLIP)モデルと視覚的および言語的モダリティを一致させる。
論文参考訳（メタデータ） (2023-08-05T12:27:01Z)
CoPL: Contextual Prompt Learning for Vision-Language Understanding [21.709017504227823]
画像の局所的な特徴にプロンプトを調整できるコンテキスト型プロンプト学習(CoPL)フレームワークを提案する。これまでの研究における重要なイノベーションは、素早い学習プロセスの一部としてローカルな画像機能を使うこと、そしてさらに重要なのは、そのタスクに適したローカルな機能に基づいてこれらのプロンプトを重み付けすることである。本手法は, 工法の現状と比較して, 性能を著しく向上させる。
論文参考訳（メタデータ） (2023-07-03T10:14:33Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
エンドツーエンドのCLIP駆動参照画像フレームワーク(CRIS)を提案する。 CRISは、テキストとピクセルのアライメントを達成するために、視覚言語によるデコーディングとコントラスト学習に頼っている。提案するフレームワークは, 後処理を伴わずに, 最先端の性能を著しく向上させる。
論文参考訳（メタデータ） (2021-11-30T07:29:08Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。