Fugu-MT 論文翻訳(概要): Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

論文の概要: Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

arxiv url: http://arxiv.org/abs/2606.02352v1
Date: Mon, 01 Jun 2026 15:01:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.371586
Title: Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection
Title（参考訳）: ロバスト自己監督型ドライバ抽出のためのマルチモーダルビデオ表示アライメント
Authors: David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen,
Abstract要約: 本研究では,マルチモーダルなグローバルアライメントのための新しいフレームワークを提案する。サイクル整合性スコアから導かれるソフトターゲットを導入し、ハードネガティブな仮定を緩和する。提案手法をDrive&Actデータセット上で評価し、ペアワイドと既存のグローバルアライメントベースラインを一貫して上回ることを示す。
参考スコア（独自算出の注目度）: 23.160444017943473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the Drive&Act dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.
Abstract（参考訳）: マルチモーダルなビデオ表現のロバストな自己教師付き学習は、複数のセンサが相補的だがノイズの多い信号を提供するドライバーの注意散らし検出のような現実世界の応用には不可欠である。 InfoNCEのような従来の対照的な目的は、全ての負が等しく情報的であり、全ての正が信頼できると仮定する。しかし、この仮定は、視点の変化、隠蔽、あるいはモダリティ間の意味的な重なり合いにより、多モーダルデータにしばしば違反する。本研究では,これらの課題に対処する多モーダルなグローバルアライメントのための新しい枠組みを提案する。サイクル一貫性スコアから導かれるソフトターゲットを導入し、強弱な仮定を緩和し、類似度分布に基づく重み付け機構を導入し、ノイズや不良な正の影響を緩和する。我々のアプローチは、従来のペアアライメントを、すべてのモダリティ対をまたいだアライメント情報を集約する、原則化されたグローバルなマルチモーダル設定に拡張する。提案手法をDrive&Actデータセット上で評価し,RGB,IR,Depth,Skeletonモダリティの両面において,ペアワイドおよび既存のグローバルアライメントベースラインを一貫して上回ることを示す。クロスビューアブレーション研究はさらに、見えないカメラ視点への強力な一般化を示し、我々の表現の堅牢性を強調している。全体として、我々のフレームワークは、自己教師付きグローバルマルチモーダル表現学習のためのスケーラブルで効果的なソリューションを提供し、実世界のマルチモーダルビデオ理解における信頼性の高いドライバーの注意散らし検出とパイオニア化を可能にしている。私たちのコードはGitHubで公開される予定です。

関連論文リスト

Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization [22.711843818785454]
フリーオムニMVSはペアの相関関係を頑健でオールニの認識とグローバルなコンセンサスに集約することができる。可視性を考慮したコンセンサスを実現するために,相関ベクトルを適応的に融合する軽量アテンション機構を提案する。多様なベンチマークデータセットに対する実験は、グローバルに一貫性があり、可視性があり、スケールアウェアな深さ推定のための手法の優位性を実証している。
論文参考訳（メタデータ） (2026-03-16T09:23:23Z)
Perception-Aware Multimodal Spatial Reasoning from Monocular Images [57.42071289037214]
単眼画像からの空間的推論は自律運転には不可欠です現在のヴィジュアルランゲージモデル(VLM)は、微粒な幾何学的知覚に苦慮している。本稿では,VLMを明示的な対象中心の接地能力を持つ知覚認識型マルチモーダル推論フレームワークを提案する。
論文参考訳（メタデータ） (2026-03-07T02:05:12Z)
DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking [10.270441242480482]
本稿では,マルチモーダル融合を反復的特徴アライメントプロセスとして再構成する新しいフレームワークであるDM$3$Tを提案する。提案するクロスモーダル拡散融合(C-MDF)モジュールを用いて,反復的クロスモーダル調和を行う。トラッカーのロバスト性をさらに向上するために,信頼性推定を適応的に処理する階層型トラッカーを設計する。
論文参考訳（メタデータ） (2025-11-28T06:02:58Z)
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
実世界のマルチビューデータセットは、しばしば不均一で不完全である。本稿では,表現融合とアライメントを同時に行う新しいロバストMVL法(RML)を提案する。我々のRMLは自己教師型であり、正規化として下流のタスクにも適用できます。
論文参考訳（メタデータ） (2025-03-06T07:01:08Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
我々は,モダリティごとの個別表現を学習し,維持することのできる,新しいモダリティインタラクション戦略を導入する。 DeepInteraction++はマルチモーダルなインタラクション・フレームワークであり、マルチモーダルな表現型インタラクション・エンコーダとマルチモーダルな予測型インタラクション・デコーダを特徴とする。実験では,3次元物体検出とエンドツーエンドの自律走行の両方において,提案手法の優れた性能を示す。
論文参考訳（メタデータ） (2024-08-09T14:04:21Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
AVSR(Advanced Audio-Visual Speech Recognition)システムは、欠落したビデオフレームに敏感であることが観察されている。ビデオモダリティにドロップアウト技術を適用することで、フレーム不足に対するロバスト性が向上する一方、完全なデータ入力を扱う場合、同時に性能損失が発生する。本稿では,MDA-KD(Multimodal Distribution Approximation with Knowledge Distillation)フレームワークを提案する。
論文参考訳（メタデータ） (2024-03-07T06:06:55Z)
Mutual Information Regularization for Weakly-supervised RGB-D Salient Object Detection [33.210575826086654]
弱教師付きRGB-Dサルエント物体検出モデルを提案する。モーダル相互情報正規化による効果的なマルチモーダル表現学習に着目した。
論文参考訳（メタデータ） (2023-06-06T12:36:57Z)
Ensemble Learning for Fusion of Multiview Vision with Occlusion and Missing Information: Framework and Evaluations with Real-World Data and Applications in Driver Hand Activity Recognition [0.0]
マルチセンサーフレームワークは、アンサンブル学習とセンサー融合の機会を提供する。欠落した情報を扱うための計算手法を提案し,解析する。並列畳み込みニューラルネットワーク間のレイトフュージョンアプローチは、最良配置の単一カメラモデルよりも優れていることを示す。
論文参考訳（メタデータ） (2023-01-30T00:24:27Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。