Fugu-MT 論文翻訳(概要): From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection

論文の概要: From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection

arxiv url: http://arxiv.org/abs/2506.23519v1
Date: Mon, 30 Jun 2025 05:01:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-01 21:27:53.924074
Title: From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection
Title（参考訳）: 視界から視界へ:監視対象検出の微妙な監視で視線追跡を解き放つ
Authors: Qi Qin, Runmin Cong, Gen Zhan, Yiting Liao, Sam Kwong,
Abstract要約: 本稿では,弱い監督下での健全な物体の検出を支援するために,固定情報を導入することを目的とする。特徴学習過程における位置と意味のガイダンスを提供するために,位置と意味の埋め込み (PSE) モジュールを提案する。 Intra-Inter Mixed Contrastive (MCII)モデルは、弱い監督下での時間的モデリング能力を改善する。
参考スコア（独自算出の注目度）: 60.11169426478452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The eye-tracking video saliency prediction (VSP) task and video salient object detection (VSOD) task both focus on the most attractive objects in video and show the result in the form of predictive heatmaps and pixel-level saliency masks, respectively. In practical applications, eye tracker annotations are more readily obtainable and align closely with the authentic visual patterns of human eyes. Therefore, this paper aims to introduce fixation information to assist the detection of video salient objects under weak supervision. On the one hand, we ponder how to better explore and utilize the information provided by fixation, and then propose a Position and Semantic Embedding (PSE) module to provide location and semantic guidance during the feature learning process. On the other hand, we achieve spatiotemporal feature modeling under weak supervision from the aspects of feature selection and feature contrast. A Semantics and Locality Query (SLQ) Competitor with semantic and locality constraints is designed to effectively select the most matching and accurate object query for spatiotemporal modeling. In addition, an Intra-Inter Mixed Contrastive (IIMC) model improves the spatiotemporal modeling capabilities under weak supervision by forming an intra-video and inter-video contrastive learning paradigm. Experimental results on five popular VSOD benchmarks indicate that our model outperforms other competitors on various evaluation metrics.
Abstract（参考訳）: The eye-tracking video saliency Prediction (VSP) task and video salient Object Detection (VSOD) task are both focus on the most attractive objects in video and show the results in the form of predictive heatmaps and pixel-level saliency masks。実際の応用では、アイトラッカーアノテーションはより容易に入手でき、人間の目の真正な視覚パターンと密接に一致している。そこで本研究では,弱い監督下での映像有能な物体の検出を支援するために,固定情報を導入することを目的とする。一方,固定によって提供される情報をよりよく探索し,活用する方法を考察し,特徴学習過程における位置と意味のガイダンスを提供するために,位置と意味の埋め込み(PSE)モジュールを提案する。一方,特徴選択と特徴コントラストの両面から,弱い監督下での時空間特徴モデリングを実現する。 Semantics and Locality Query (SLQ) Competitor with semantic and Locality constraints is designed to effective select the most matching and accurate object query for spatiotemporal modeling。さらに、IIMCモデルでは、ビデオ内およびビデオ間コントラスト学習パラダイムを形成することにより、低監督下での時空間モデリング能力を向上する。 5つの人気のあるVSODベンチマークの実験結果から、我々のモデルは様々な評価指標で他の競争相手よりも優れていたことが示唆された。

関連論文リスト

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes [35.16430027877207]
MOVISは、多目的NVSのためのビュー条件拡散モデルの構造的認識を高めることを目的としている。本稿では,新しいビューオブジェクトマスクを同時に予測するためにモデルを必要とする補助タスクを提案する。提案手法は強力な一般化能力を示し,一貫した新規なビュー合成を生成する。
論文参考訳（メタデータ） (2024-12-16T05:23:45Z)
What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
ビデオインスタンスセグメンテーション(VIS)は、ビデオ内のオブジェクトを検出し、セグメンテーションし、追跡することを目的とした、難しいビジョンタスクである。トレーニング中、ビデオフレーム内の各オブジェクトについて、人間のアノテーションを1点に減らし、完全に教師されたモデルに近い高品質なマスク予測を得る。 3つのVISベンチマークに関する総合的な実験は、提案フレームワークの競合性能を示し、完全に教師付きされた手法にほぼ一致する。
論文参考訳（メタデータ） (2024-04-01T17:38:25Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
本稿では,ビデオストリームの時間的一貫性を利用して,不正確なフローベース提案を補正する外観に基づく改善手法を提案する。提案手法では,高精度なフロー予測マスクを模範として,シーケンスレベルの選択機構を用いる。パフォーマンスは、DAVIS、YouTube、SegTrackv2、FBMS-59など、複数のビデオセグメンテーションベンチマークで評価されている。
論文参考訳（メタデータ） (2023-12-18T18:59:51Z)
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
オブジェクト認識デコーダを導入し、エゴ中心の動画におけるエゴ中心の表現の性能を向上させる。このモデルは,エゴ認識ビデオモデルの代替として機能し,視覚テキストのグラウンド化による性能向上を図っている。
論文参考訳（メタデータ） (2023-08-15T17:58:11Z)
Look, Remember and Reason: Grounded reasoning in videos with language models [5.3445140425713245]
マルチテンポラル言語モデル(LM)は、最近ビデオ上の高レベル推論タスクにおいて有望な性能を示した。オブジェクト検出,再識別,追跡など,低レベルなサロゲートタスクに対するLMエンドツーエンドのトレーニングを提案し,低レベルな視覚能力を備えたモデルを実現する。我々は、ACRE、CATER、Some-Else、STARデータセットからの多様な視覚的推論タスクにおけるフレームワークの有効性を実証する。
論文参考訳（メタデータ） (2023-06-30T16:31:14Z)
Video Salient Object Detection via Contrastive Features and Attention Modules [106.33219760012048]
本稿では,注目モジュールを持つネットワークを用いて,映像の有意な物体検出のためのコントラスト特徴を学習する。コアテンションの定式化は、低レベル特徴と高レベル特徴を組み合わせるために用いられる。提案手法は計算量が少なく,最先端の手法に対して良好に動作することを示す。
論文参考訳（メタデータ） (2021-11-03T17:40:32Z)
Weakly Supervised Video Salient Object Detection [79.51227350937721]
本稿では,relabeled relabeled "fixation guided scribble annotations" に基づく最初の弱教師付きビデオサリエント物体検出モデルを提案する。効果的なマルチモーダル学習と長期時間文脈モデリングを実現するために,「アプレンス・モーション・フュージョン・モジュール」と双方向のConvLSTMベースのフレームワークを提案する。
論文参考訳（メタデータ） (2021-04-06T09:48:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。