Fugu-MT 論文翻訳(概要): SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

論文の概要: SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

arxiv url: http://arxiv.org/abs/2602.11656v1
Date: Thu, 12 Feb 2026 07:21:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-13 21:07:25.691119
Title: SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
Title（参考訳）: SToRM:効率的なエンド・ツー・エンド自動運転に向けたマルチモーダルLCMのトーケン削減
Authors: Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Ho Gun Park, Il Yong Chun,
Abstract要約: 本稿では,マルチモーダル大言語モデル(MLLM)のための第1回スーパービジョントークン削減フレームワークを提案する。提案フレームワークは,3つの重要な要素から構成される。第1に,短期スライディングウィンドウを用いた軽量な重要度予測器は,トークンの重要度を推定する。第2に,教師付きトレーニング手法では,全方向LPMパスから擬似スーパービジョン信号を取得するための補助パスを用いる。第3に,アンカーコンテキストマージモジュールパーティションをアンカートークンとコンテキストトークンにマージし,コンテキストトークンを関連するアンカーにマージすることで,情報損失を最小限に抑えながら冗長性を低減できる。LangAutoベンチマークの実験では,SToRMが最先端Eより優れていることを示す。
参考スコア（独自算出の注目度）: 11.13872942531757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.
Abstract（参考訳）: 自律運転では、センサデータから直接制御コマンドを予測するエンド・ツー・エンド(E2E)駆動システムが大きな進歩を遂げている。予期せぬシナリオでの安全な運転のために、これらのシステムは自然言語命令のような人間の介入にも依存する。 MLLM(Multi-modal large language model)を用いることで、車同士の相互作用が促進され、そのようなシナリオのパフォーマンスが向上する。しかし、このアプローチはLLMへの依存と、自動運転車に限られるセンサー入力からの多くの視覚的トークンにより、かなりの計算資源を必要とする。多くのMLLM研究は、視覚的トークンを減らすことを検討したが、全てのトークンを使用する場合と比較して、エンドタスクのパフォーマンス劣化に悩まされることが多い。本報告では, マルチモーダルLCM(SToRM)のための第1回スーパービジョントークン削減フレームワークを提案する。提案するフレームワークは,3つの重要な要素で構成されている。第一に、短期スライディングウインドウを用いた軽量重要度予測器はトークン重要度スコアを推定する。第2に、教師付きトレーニングアプローチでは、補助パスを使用して、オールトーケンLSMパスから擬似スーパービジョン信号を取得する。第3に、アンカーコンテキストのマージモジュールがトークンをアンカーとコンテキストトークンに分割し、コンテキストトークンを関連するアンカーにマージすることで、情報損失を最小限に抑えながら冗長性を低減する。 LangAutoベンチマークの実験によると、SToRMは最先端のE2E駆動MLLMを同じ予算で性能良くし、全トーケン性能を維持しつつ、計算コストを最大30倍に削減している。

論文の概要: SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

関連論文リスト