Fugu-MT 論文翻訳(概要): DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

論文の概要: DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

arxiv url: http://arxiv.org/abs/2408.05075v2
Date: Thu, 15 Aug 2024 11:03:41 GMT
ステータス: 翻訳完了
システム内更新日: 2024-08-16 12:51:16.374886
Title: DeepInteraction++: Multi-Modality Interaction for Autonomous Driving
Title（参考訳）: DeepInteraction++: 自律運転のためのマルチモードインタラクション
Authors: Zeyu Yang, Nan Song, Wei Li, Xiatian Zhu, Li Zhang, Philip H. S. Torr,
Abstract要約: 我々は,モダリティごとの個別表現を学習し,維持することのできる,新しいモダリティインタラクション戦略を導入する。 DeepInteraction++はマルチモーダルなインタラクション・フレームワークであり、マルチモーダルな表現型インタラクション・エンコーダとマルチモーダルな予測型インタラクション・デコーダを特徴とする。実験では,3次元物体検出とエンドツーエンドの自律走行の両方において,提案手法の優れた性能を示す。
参考スコア（独自算出の注目度）: 80.8837864849534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing top-performance autonomous driving systems typically rely on the multi-modal fusion strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout, enabling their unique characteristics to be exploited during the whole perception pipeline. To demonstrate the effectiveness of the proposed strategy, we design DeepInteraction++, a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Specifically, the encoder is implemented as a dual-stream Transformer with specialized attention operation for information exchange and integration between separate modality-specific representations. Our multi-modal representational learning incorporates both object-centric, precise sampling-based feature alignment and global dense information spreading, essential for the more challenging planning task. The decoder is designed to iteratively refine the predictions by alternately aggregating information from separate representations in a unified modality-agnostic manner, realizing multi-modal predictive interaction. Extensive experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks. Our code is available at https://github.com/fudan-zvg/DeepInteraction.
Abstract（参考訳）: 既存の高性能自動運転システムは、信頼性の高いシーン理解のためのマルチモーダル融合戦略に依存している。しかしながら、この設計は、モダリティ固有の強度を見落とし、最終的にモデル性能を妨げているため、基本的に制限されている。この制限に対処するため、本研究では、各モードごとの表現を学習・維持できる新しいモダリティインタラクション戦略を導入し、知覚パイプライン全体において、それぞれの特徴を活用できるようにする。提案手法の有効性を実証するため,マルチモーダル表現型対話エンコーダとマルチモーダル予測型対話デコーダを特徴とするマルチモーダル対話フレームワークであるDeepInteraction++を設計した。具体的には、情報交換のための特別な注意操作と、個別のモダリティ特化表現の統合を備えたデュアルストリーム変換器として実装される。我々のマルチモーダル表現学習は、より困難な計画作業に欠かせない、オブジェクト中心、精密なサンプリングベースの特徴アライメントと、グローバルな密集情報拡散の両方を取り入れています。このデコーダは、異なる表現から情報を統一的なモダリティに依存しない方法で交互に集約し、マルチモーダルな予測相互作用を実現することにより、予測を反復的に洗練するように設計されている。大規模実験では,3次元物体検出とエンドツーエンドの自律走行の両方において,提案手法の優れた性能を示す。私たちのコードはhttps://github.com/fudan-zvg/DeepInteraction.comで利用可能です。

関連論文リスト

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
マルチモーダル表現学習は人工知能領域において重要な役割を担っている。本稿では,多変量協調ゲーム理論を用いて,ビデオテキストをゲームプレイヤーとしてモデル化する手法を提案する。元の構造をフレキシブルなエンコーダ・デコーダ・フレームワークに拡張し、モデルが様々な下流タスクに適応できるようにする。
論文参考訳（メタデータ） (2024-12-30T14:09:15Z)
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [65.04643267731122]
一般的なMLLMとCLIPの組み合わせは、駆動固有のシナリオを正確に表現するのに苦労することが多い。 Hints of Prompt (HoP) フレームワークを提案する。これらのヒントはHint Fusionモジュールを通じて融合され、視覚的表現が強化され、マルチモーダル推論が強化される。
論文参考訳（メタデータ） (2024-11-20T06:58:33Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
教師なしの事前訓練は骨格に基づく行動理解において大きな成功を収めた。我々はUmURLと呼ばれる統一マルチモーダル非教師なし表現学習フレームワークを提案する。 UmURLは効率的な早期融合戦略を利用して、マルチモーダル機能を単一ストリームで共同でエンコードする。
論文参考訳（メタデータ） (2023-11-06T13:56:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
マルチモーダルな操作検出とグラウンド処理のためのトランスフォーマーベースのフレームワークを構築する。本フレームワークは,マルチモーダルアライメントの能力を維持しながら,モダリティ特有の特徴を同時に探求する。本稿では,グローバルな文脈的キューを各モーダル内に適応的に集約する暗黙的操作クエリ(IMQ)を提案する。
論文参考訳（メタデータ） (2023-09-22T06:55:41Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
マルチモーダルエンティティリンクタスクは、マルチモーダル知識グラフへの曖昧な言及を解決することを目的としている。 MELタスクを解決するための新しいMulti-Grained Multimodal InteraCtion Network $textbf(MIMIC)$ frameworkを提案する。
論文参考訳（メタデータ） (2023-07-19T02:11:19Z)
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis [19.07020276666615]
本稿では,マルチモーダル表現のためのMMCL(MultiModal Contrastive Learning)というフレームワークを提案する。また、予測のプロセスを促進し、感情に関連するよりインタラクティブな情報を学ぶために、事例ベースと感情ベースのコントラスト学習という2つのコントラスト学習タスクを設計する。
論文参考訳（メタデータ） (2022-10-26T08:24:15Z)
DeepInteraction: 3D Object Detection via Modality Interaction [37.85057350887215]
トップパフォーマンスな3Dオブジェクト検出器のための新しいモダリティインタラクション戦略を導入する。本手法は,高度に競争力のあるnuScenesオブジェクト検出リーダーボードにおいて,第1位にランクされている。
論文参考訳（メタデータ） (2022-08-23T17:52:54Z)
Interactive Fusion of Multi-level Features for Compositional Activity Recognition [100.75045558068874]
インタラクティブな融合によってこの目標を達成する新しいフレームワークを提案する。本フレームワークは,位置から出現までの特徴抽出,意味的特徴の相互作用,意味から位置への予測という3つのステップで実装する。我々は,2つの行動認識データセット,SomethingとCharadesに対するアプローチを評価した。
論文参考訳（メタデータ） (2020-12-10T14:17:18Z)
Pedestrian Behavior Prediction via Multitask Learning and Categorical Interaction Modeling [13.936894582450734]
マルチモーダルデータに頼って歩行者の軌跡や行動を同時に予測するマルチタスク学習フレームワークを提案する。本モデルでは, トラジェクティブと動作予測を最大22%, 6%向上させる。
論文参考訳（メタデータ） (2020-12-06T15:57:11Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。