Fugu-MT 論文翻訳(概要): Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

論文の概要: Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

arxiv url: http://arxiv.org/abs/2307.13529v2
Date: Mon, 18 Sep 2023 09:28:46 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-19 23:08:16.304293
Title: Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection
Title（参考訳）: Re-mine, Learn and Reason: 言語誘導HOI検出のためのクロスモーダルセマンティック相関の探索
Authors: Yichao Cao, Qingfei Tang, Feng Yang, Xiu Su, Shan You, Xiaobo Lu and Chang Xu
Abstract要約: ヒューマンオブジェクトインタラクション(HOI)検出は、コンピュータビジョンの課題である。本稿では,構造化テキスト知識を組み込んだHOI検出フレームワークを提案する。
参考スコア（独自算出の注目度）: 57.13665112065285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-Object Interaction (HOI) detection is a challenging computer vision task that requires visual models to address the complex interactive relationship between humans and objects and predict HOI triplets. Despite the challenges posed by the numerous interaction combinations, they also offer opportunities for multimodal learning of visual texts. In this paper, we present a systematic and unified framework (RmLR) that enhances HOI detection by incorporating structured text knowledge. Firstly, we qualitatively and quantitatively analyze the loss of interaction information in the two-stage HOI detector and propose a re-mining strategy to generate more comprehensive visual representation.Secondly, we design more fine-grained sentence- and word-level alignment and knowledge transfer strategies to effectively address the many-to-many matching problem between multiple interactions and multiple texts.These strategies alleviate the matching confusion problem that arises when multiple interactions occur simultaneously, thereby improving the effectiveness of the alignment process. Finally, HOI reasoning by visual features augmented with textual knowledge substantially improves the understanding of interactions. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on public benchmarks. We further analyze the effects of different components of our approach to provide insights into its efficacy.
Abstract（参考訳）: ヒューマン・オブジェクト・インタラクション(human-object interaction, hoi)は、人間と物体の複雑な対話的関係に対処し、hoiトリプルトを予測する視覚モデルを必要とするコンピュータビジョンタスクである。多くの相互作用の組み合わせによってもたらされる課題にもかかわらず、視覚テキストのマルチモーダル学習の機会を提供する。本稿では,構造化テキスト知識を取り入れることで,hoi検出を強化する体系的統一フレームワーク(rmlr)を提案する。 Firstly, we qualitatively and quantitatively analyze the loss of interaction information in the two-stage HOI detector and propose a re-mining strategy to generate more comprehensive visual representation.Secondly, we design more fine-grained sentence- and word-level alignment and knowledge transfer strategies to effectively address the many-to-many matching problem between multiple interactions and multiple texts.These strategies alleviate the matching confusion problem that arises when multiple interactions occur simultaneously, thereby improving the effectiveness of the alignment process. 最後に、テキスト知識を付加した視覚特徴によるHOI推論は、インタラクションの理解を大幅に改善する。実験結果は,公開ベンチマークにおいて最先端のパフォーマンスが達成される手法の有効性を示す。さらに,このアプローチのさまざまなコンポーネントの効果を解析し,その効果について考察する。

関連論文リスト

Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection [51.52749744031413]
人間オブジェクトインタラクション(HOI)検出は、画像内の人間と物体を識別し、その相互作用を解釈することを目的としている。既存のHOIメソッドは、視覚的手がかりからインタラクションを学ぶために手動アノテーションを備えた大規模なデータセットに大きく依存している。本稿では,強化意味論を用いた動的スコーリングのための新しいトレーニング不要なHOI検出フレームワークを提案する。
論文参考訳（メタデータ） (2025-07-23T12:30:19Z)
InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions [22.007942964950217]
視覚要素の直接操作と自然言語入力を組み合わせた生成的視覚分析システムであるInterChatを開発した。この統合により、正確なインテント通信が可能になり、プログレッシブで視覚的に駆動された探索データ分析をサポートする。
論文参考訳（メタデータ） (2025-03-06T05:35:19Z)
Effective Context Modeling Framework for Emotion Recognition in Conversations [2.7175580940471913]
会話における感情認識(英語: Emotion Recognition in Conversations, ERC)は、会話中の各発話における話者による感情のより深い理解を促進する。最近のグラフニューラルネットワーク(GNN)は、データ関係をキャプチャする上で、その強みを実証している。本稿では,会話中の文脈情報をキャプチャする新しいGNNベースのフレームワークであるConxGNNを提案する。
論文参考訳（メタデータ） (2024-12-21T02:22:06Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
本稿では,視覚的・幾何学的手がかりを取り入れた視覚・幾何学的協調学習ネットワークを提案する。本手法は,客観的指標と視覚的品質の代表的なモデルより優れている。
論文参考訳（メタデータ） (2024-10-15T07:35:51Z)
Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models [25.070424546200293]
本稿では,大規模言語モデル(LLM)の頑健な推論機能を活用して,正確な対話関連視覚記述子を生成する手法を提案する。ベンチマークデータを用いて行った実験は、簡潔で正確な視覚記述子の導出における提案手法の有効性を検証した。本研究は,多様な視覚的手がかり,多様なLCM,異なるデータセットにまたがる手法の一般化可能性を示すものである。
論文参考訳（メタデータ） (2024-07-04T03:50:30Z)
CoSD: Collaborative Stance Detection with Contrastive Heterogeneous Topic Graph Learning [18.75039816544345]
我々はCoSD(CoSD)と呼ばれる新しい協調姿勢検出フレームワークを提案する。 CoSDは、テキスト、トピック、スタンスラベル間のトピック認識のセマンティクスと協調的なシグナルを学ぶ。 2つのベンチマークデータセットの実験では、CoSDの最先端検出性能が示されている。
論文参考訳（メタデータ） (2024-04-26T02:04:05Z)
AntEval: Evaluation of Social Interaction Competencies in LLM-Driven Agents [65.16893197330589]
大規模言語モデル(LLM)は、幅広いシナリオで人間の振る舞いを再現する能力を示した。しかし、複雑なマルチ文字のソーシャルインタラクションを扱う能力については、まだ完全には研究されていない。本稿では,新しいインタラクションフレームワークと評価手法を含むマルチエージェントインタラクション評価フレームワーク(AntEval)を紹介する。
論文参考訳（メタデータ） (2024-01-12T11:18:00Z)
DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition [14.639340916340801]
本稿では,多モーダル感情認識(DER-GCN)のための新しい対話・イベント関係対応グラフ畳み込みニューラルネットワークを提案する。話者間の対話関係をモデル化し、潜在イベント関係情報をキャプチャする。 DER-GCNモデルの有効性を検証したIEMOCAPおよびMELDベンチマークデータセットについて広範な実験を行った。
論文参考訳（メタデータ） (2023-12-17T01:49:40Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
ヒューマン・オブジェクト・インタラクション(HOI)検出は、人間中心の画像理解のコアタスクである。最近のワンステージ手法では、対話予測に有用な画像ワイドキューの収集にトランスフォーマーデコーダを採用している。従来の2段階の手法は、非絡み合いで説明可能な方法で相互作用特徴を構成する能力から大きな恩恵を受ける。
論文参考訳（メタデータ） (2023-12-04T08:02:59Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCueは、HOI検出における視覚的特徴抽出を改善するための新しいアプローチである。コンテクストキューをインスタンスと相互作用検出器の両方に統合するマルチトウワーアーキテクチャを用いたトランスフォーマーベースの特徴抽出モジュールを開発した。
論文参考訳（メタデータ） (2023-11-26T09:11:32Z)
Compositional Learning in Transformer-Based Human-Object Interaction Detection [6.630793383852106]
ラベル付きインスタンスの長期分布は、HOI検出の主要な課題である。 HOI三重奏の性質にインスパイアされた既存のアプローチでは、作曲学習という概念が採用されている。我々は,構成HoI学習のためのトランスフォーマーベースのフレームワークを創造的に提案する。
論文参考訳（メタデータ） (2023-08-11T06:41:20Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
マルチモーダルエンティティリンクタスクは、マルチモーダル知識グラフへの曖昧な言及を解決することを目的としている。 MELタスクを解決するための新しいMulti-Grained Multimodal InteraCtion Network $textbf(MIMIC)$ frameworkを提案する。
論文参考訳（メタデータ） (2023-07-19T02:11:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。