Fugu-MT 論文翻訳(概要): ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

論文の概要: ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

arxiv url: http://arxiv.org/abs/2603.21237v1
Date: Sun, 22 Mar 2026 13:54:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.307944
Title: ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
Title（参考訳）: ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
Authors: Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, Jie Liu,
Abstract要約: ConsRouteは、大規模言語モデルのための軽量でセマンティックな、適応的なルーティングフレームワークである。 ConsRouteは、エンドツーエンドのレイテンシと推論コストを40%近く削減しながら、ほぼクラウドのパフォーマンス(=95%)を達成することを示す。
参考スコア（独自算出の注目度）: 7.869130026927
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
Abstract（参考訳）: 大きな言語モデル(LLM)は印象的な機能を提供するが、相当な推論遅延とコストが発生するため、レイテンシに敏感でリソース制約のあるシナリオへのデプロイメントを妨げている。クラウド-エッジ-デバイス共同推論は、階層毎に異なるキャパシティのモデルにクエリを動的にルーティングすることで、有望なパラダイムとして登場した。本稿では,応答品質への影響を最小限に抑えつつ,推論効率を大幅に向上する軽量でセマンティックな適応型ルーティングフレームワークであるConsRouteを提案する。粗い出力品質のギャップを予測する従来のルーティング方法とは異なり、ConsRouteはリランカを利用して、異なる階層のモデルによって生成されたレスポンス間のセマンティック一貫性を直接評価し、ルーティングのためのきめ細かいソフト監視信号を生成する。デバイス側のオーバーヘッドを最小限に抑えるため、ConsRouteはLLMプリフィルステージから隠れた状態をコンパクトなクエリ表現として再利用し、追加のエンコーダや推論パスを回避する。さらに、これらの表現はクラスタ化されており、ベイジアン最適化はクラスタ固有のルーティングしきい値の学習に使われ、不均一なクエリ分散の下で品質、レイテンシ、コストを動的にバランスさせる。大規模な実験では、ConsRouteがほぼクラウドのパフォーマンス(>=95%)を達成しつつ、エンドツーエンドのレイテンシと推論コストを40%近く削減し、応答品質とシステム効率の両方において、既存のルーティングベースラインを一貫して上回っていることが示されている。

論文の概要: ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

関連論文リスト