Fugu-MT 論文翻訳(概要): OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

論文の概要: OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

arxiv url: http://arxiv.org/abs/2409.19899v1
Date: Mon, 30 Sep 2024 02:58:05 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-05 16:57:15.440788
Title: OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection
Title（参考訳）: OpenKD: ゼロショットとFewショットのキーポイント検出のためのプロンプト多様性の開放
Authors: Changsheng Lu, Zheyuan Liu, Piotr Koniusz,
Abstract要約: モダリティ、意味論(見当たらない対面)、言語という3つの側面から、迅速な多様性を開放する。視覚とテキストのプロンプトをサポートするために,マルチモーダルなプロトタイプセットを利用する新しいOpenKDモデルを提案する。
参考スコア（独自算出の注目度）: 35.57926269889791
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Exploiting the foundation models (e.g., CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (e.g., ``the nose of a cat''), or the visual prompt (e.g., support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on taking multimodal prompt is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts like ``Can you detect the nose and ears of a cat?'' In this work, we open the prompt diversity from three aspects: modality, semantics (seen v.s. unseen), and language, to enable a more generalized zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated from visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also found large language model (LLM) is a good parser, which achieves over 96% accuracy to parse keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways to deal with unseen text and diverse texts. The source code and data are available at https://github.com/AlanLuSun/OpenKD.
Abstract（参考訳）: 汎用キーポイント検出器を構築するための基礎モデル(例えばCLIP)の展開が注目されている。ほとんどの既存モデルは、テキストプロンプト(例: `` ``the nose of a cat'')またはビジュアルプロンプト(例:キーポイントアノテーションによるイメージのサポート)を受け入れて、クエリ画像の対応するキーポイントを検出し、ゼロショットまたは少数ショット検出能力を示す。しかし、マルチモーダルなプロンプトの取得に関する研究はいまだ未定であり、セマンティクスや言語における迅速な多様性は明らかになっていない。例えば、新しいキーポイント検出のための未知のテキストプロンプトと‘Can you detect the nose and ears of a cat?’のような多様なテキストプロンプトをどう扱うか。本研究では、モダリティ、セマンティクス(見当たらない)、言語という3つの側面から迅速な多様性を開放し、より一般化されたゼロショットと少数ショットのキーポイント検出(Z-FSKD)を可能にする。視覚とテキストのプロンプトをサポートするために,マルチモーダルなプロトタイプセットを利用する新しいOpenKDモデルを提案する。さらに、未知のテキストのキーポイント位置を推測するために、視覚的およびテキスト的領域から補間された補助的なキーポイントとテキストをトレーニングに追加し、モデルの空間的推論を改善し、ゼロショットの新規キーポイント検出を大幅に強化する。また、大きな言語モデル(LLM)は、テキストからキーポイントを解析する精度を96%以上達成できるパーサであることも見出した。 LLMでは、OpenKDは多様なテキストプロンプトを処理できる。実験により,本手法はZ-FSKD上での最先端性能を実現し,未知のテキストや多様なテキストに対処する新たな方法を開始することを示す。ソースコードとデータはhttps://github.com/AlanLuSun/OpenKD.comで公開されている。

関連論文リスト

KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension [31.283133365170052]
さまざまなタスクシナリオでキーポイントを理解することを目的としたセマンティック・キーポイントを紹介します。また,KptLLMは,識別・検出戦略を利用する統一型マルチモーダルモデルである。 KptLLMは様々なモダリティ入力を順応的に処理し、意味内容とキーポイント位置の両方の解釈を容易にする。
論文参考訳（メタデータ） (2024-11-04T06:42:24Z)
Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval [13.315951821189538]
シーンテキスト検索は、画像ギャラリーからクエリテキストを含むすべての画像を見つけることを目的としている。現在の取り組みでは、複雑なテキスト検出および/または認識プロセスを必要とする光学文字認識(OCR)パイプラインを採用する傾向にある。我々は,OCRのないシーンテキスト検索のためのCLIP(Contrastive Language- Image Pre-Trening)の本質的な可能性について検討する。
論文参考訳（メタデータ） (2024-08-01T10:25:14Z)
CountGD: Multi-Modal Open-World Counting [54.88804890463491]
本稿では,画像中のオープン語彙オブジェクトの数値化の一般化と精度の向上を目的とする。本稿では,最初のオープンワールドカウントモデルであるCountGDを紹介した。
論文参考訳（メタデータ） (2024-07-05T16:20:48Z)
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
テキスト中心タスクに適した大規模マルチモーダルモデル(LMM)であるTextMonkeyを提案する。ゼロ初期化によるシフトウィンドウアテンションの導入により、高い入力解像度でクロスウィンドウ接続を実現する。テキストスポッティングとグラウンド化を包含する能力を拡張し、位置情報を応答に組み込むことで、解釈可能性を高める。
論文参考訳（メタデータ） (2024-03-07T13:16:24Z)
X-Pose: Detecting Any Keypoints [28.274913140048003]
X-Poseは画像内の複数オブジェクトのキーポイント検出のための新しいフレームワークである。 UniKPTはキーポイント検出データセットの大規模なデータセットである。 X-Poseは、最先端の非プロンプタブル、視覚的プロンプトベース、テキスト的プロンプトベースメソッドに対する顕著な改善を実現している。
論文参考訳（メタデータ） (2023-10-12T17:22:58Z)
Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD)タスクは、任意の種類のキーポイントを特定するためにテキストプロンプトを使用するように設計されている。セマンティック・フェールマッチング(KDSM)を用いた開語彙キーポイント検出(Open-Vocabulary Keypoint Detection)という新しいフレームワークを開発した。このフレームワークは視覚と言語モデルを組み合わせて、言語機能とローカルキーポイント視覚機能との相互作用を作成する。
論文参考訳（メタデータ） (2023-10-08T07:42:41Z)
Towards Unified Scene Text Spotting based on Sequence Generation [4.437335677401287]
UNIfied scene Text Spotter(UNITS)を提案する。我々のモデルは四角形や多角形を含む様々な検出形式を統一する。任意の開始点からテキストを抽出するために、開始点プロンプトを適用する。
論文参考訳（メタデータ） (2023-04-07T01:28:08Z)
Open-Vocabulary Point-Cloud Object Detection without 3D Annotation [62.18197846270103]
オープン語彙の3Dポイントクラウド検出の目的は、任意のテキスト記述に基づいて新しいオブジェクトを識別することである。様々な物体を局所化するための一般的な表現を学習できる点クラウド検出器を開発した。また,画像,点雲,テキストのモダリティを結合する,非偏差三重項クロスモーダルコントラスト学習を提案する。
論文参考訳（メタデータ） (2023-04-03T08:22:02Z)
AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting [98.08853679310603]
本研究はAmbiguity Elimination Text Spotter(AE TextSpotter)という新しいテキストスポッターを提案する。 AE TextSpotterは、視覚的特徴と言語的特徴の両方を学び、テキスト検出の曖昧さを著しく低減する。我々の知る限り、言語モデルを用いてテキスト検出を改善するのはこれが初めてである。
論文参考訳（メタデータ） (2020-08-03T08:40:01Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。