Fugu-MT 論文翻訳(概要): Toward Unified Multimodal Representation Learning for Autonomous Driving

論文の概要: Toward Unified Multimodal Representation Learning for Autonomous Driving

arxiv url: http://arxiv.org/abs/2603.07874v1
Date: Mon, 09 Mar 2026 01:18:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.334604
Title: Toward Unified Multimodal Representation Learning for Autonomous Driving
Title（参考訳）: 自律運転のための統一型マルチモーダル表現学習に向けて
Authors: Ximeng Tao, Dimitar Filev, Gaurav Pandey,
Abstract要約: コントラスト言語-画像事前学習は、視覚的およびテキスト的表現の整列において印象的な性能を示した。一般的な戦略は、3Dエンコーダのトレーニングをガイドするために、モダリティ間のペアワイズコサイン類似性を採用することである。組込み空間において複数のモードを同時に調整するコントラスト事前学習フレームワークを提案する。
参考スコア（独自算出の注目度）: 3.8019970256582094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a tensor loss to enable joint contrastive learning across all modalities. For experimental validation of our framework, we construct a text-image-point cloud triplet dataset derived from existing autonomous driving datasets. The results show that our proposed unified multimodal alignment framework achieves favorable performance for both scenarios: (i) aligning a 3D encoder with pretrained CLIP encoders, and (ii) pretraining all encoders from scratch.
Abstract（参考訳）: Contrastive Language-Image Pre-Training (CLIP) は視覚的およびテキスト的表現の整合性に優れた性能を示した。近年の研究では、このパラダイムを3次元視覚に拡張し、自動運転のシーン理解を改善している。一般的な戦略は、3Dエンコーダのトレーニングをガイドするために、モダリティ間のペアワイズコサイン類似性を採用することである。しかし、すべてのモダリティよりも個々のモダリティ対の類似性を考えると、共同でマルチモーダル空間全体にわたって一貫した統一的なアライメントを確保することに失敗する。本稿では,複数モーダルを同時に組み合わせたコントラストテンソル事前学習(Contrastive Tensor Pre-training, CTP)フレームワークを提案する。ペアワイズコサイン類似性アライメントと比較して、本手法は2次元類似性行列をマルチモーダル類似性テンソルに拡張する。さらに,全てのモダリティをまたいだ共同コントラスト学習を実現するために,テンソルロスを導入する。筆者らのフレームワークを実験的に検証するために,既存の自律運転データセットから派生したテキストイメージポイントクラウドトリプルデータセットを構築した。以上の結果から,提案した統合マルチモーダルアライメントフレームワークは,どちらのシナリオでも良好な性能を発揮することが示された。 (i)事前訓練されたCLIPエンコーダに3Dエンコーダをアライメントし、 (ii)すべてのエンコーダをスクラッチから事前訓練すること。

論文の概要: Toward Unified Multimodal Representation Learning for Autonomous Driving

関連論文リスト