Fugu-MT 論文翻訳(概要): Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

論文の概要: Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

arxiv url: http://arxiv.org/abs/2604.05947v1
Date: Tue, 07 Apr 2026 14:42:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.886984
Title: Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition
Title（参考訳）: ドライバ行動認識における多モード視覚分析のためのモダリティの混合とホロスティックトケラーニング
Authors: Tianyi Liu, Yiming Li, Wenqian Wang, Jiaojiao Wang, Chen Cai, Yi Wang, Kim-Hui Yap,
Abstract要約: 本稿では,HTL戦略を用いたMixture-of-Modality-Experts(MoME)フレームワークを提案する。 MoMEは、モダリティ固有の専門家間の適応的なコラボレーションを可能にし、専門家間の知識伝達を改善する。我々は,ドライバの動作認識に関するフレームワークを,代表的マルチモーダル理解タスクとして検証する。
参考スコア（独自算出の注目度）: 35.2947975691458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.
Abstract（参考訳）: 不均一なモダリティが相補的だが入力に依存した意思決定の証拠を提供する場合、ロバストなマルチモーダル視覚分析は依然として困難であり、既存のマルチモーダル学習手法は、主に固定融合モジュールや予め定義された相互モーダル相互作用に依存しており、しばしばモダリティの信頼性の変化に適応し、きめ細かいアクションキューを捉えるのに不十分である。この問題に対処するために,HTL(Helistic Token Learning)戦略を用いたMixture-of-Modality-Experts (MoME)フレームワークを提案する。 MoMEは、モダリティの専門家間の適応的なコラボレーションを可能にし、HTLは、クラストークンと時空間トークンによる専門家間の知識伝達を改善する。このようにして,本手法は,多モード融合におけるあいまいさを低減しつつ,専門家の専門化を向上する知識中心型マルチモーダル学習フレームワークを形成する。我々は,代表的マルチモーダル理解タスクとしてのドライバ動作認識の枠組みを検証し,提案したMoMEフレームワークとHTL戦略が,代表的単一モーダル・マルチモーダルベースラインを共同的に上回ることを示す。追加のアブレーション、バリデーション、可視化の結果は、提案したHTL戦略が微妙なマルチモーダル理解を改善し、より良い解釈性を提供することを示す。

論文の概要: Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

関連論文リスト