Fugu-MT 論文翻訳(概要): Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion

論文の概要: Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion

arxiv url: http://arxiv.org/abs/2510.24390v1
Date: Tue, 28 Oct 2025 13:05:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:37.186451
Title: Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion
Title（参考訳）: 依存型クエリ分解と論理パラレルコンテンツ拡張によるLLM推論の改善
Authors: Xianjun Gao, Jianchun Liu, Hongli Xu, Liusheng Huang,
Abstract要約: AIによる検索や会話エージェントなどのリアルタイムWebアプリケーションへのLarge Language Modelsの統合は、Webインフラストラクチャの基本的な課題である。そこで我々は,依存性を意識したクエリの分解と論理並列コンテンツの拡張を可能にする,新規で効率的な推論フレームワークOrionを提案する。多様なベンチマークの実験によると、Orionはトークン生成速度を最大4.33倍、応答遅延を3.42倍まで削減するだけでなく、推論品質を最大18.75%向上させる。
参考スコア（独自算出の注目度）: 29.45427036598799
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The integration of Large Language Models (LLMs) into real-time Web applications, such as AI-powered search and conversational agents, presents a fundamental Web infrastructure challenge: reconciling the demand for high-quality, complex reasoning with the stringent low-latency and high-throughput requirements of interactive services. Current LLM reasoning, hindered by computationally inefficient sequential generation and rigid reasoning strategies, creates a critical bottleneck for the Web services. Existing approaches typically optimize the LLM reasoning for either efficiency or quality but struggle to achieve both, and thus fail to meet the dual requirements of modern Web platforms. To overcome these limitations, we propose Orion, a novel and efficient reasoning framework that enables dependency-aware query decomposition and logic-parallel content expansion. Concretely, Orion decomposes a single query reasoning process into two synergistic phases: (1) \textit{key point generation}, which distills logically structured key points through retrieval-augmented few-shot prompting, and (2) \textit{content parallel expansion}, which concurrently elaborates on these points based on a dependency graph to ensure logical consistency. Furthermore, Orion introduces a pipeline scheduling mechanism that exploits the complementary computational characteristics of the two phases (generation imposes pressure on GPU computing and expansion stresses on GPU memory) across multiple queries, enabling cross-query parallelism and dramatically improving reasoning performance (\ie, efficiency and quality). Experiments on diverse benchmarks show that Orion not only delivers up to 4.33x higher token generation speed and 3.42x lower answer latency over the baselines but also improves reasoning quality by up to 18.75% through explicitly modeling inter-point dependencies.
Abstract（参考訳）: 大規模言語モデル(LLM)をAIによる検索や会話エージェントなどのリアルタイムWebアプリケーションに統合することは、対話型サービスの厳格な低レイテンシと高スループット要求と、高品質で複雑な推論の要求を整合させるという、基本的なWebインフラストラクチャの課題を提示する。現在のLCM推論は、計算的に非効率な逐次生成と厳密な推論戦略によって妨げられ、Webサービスにとって重要なボトルネックを生み出します。既存のアプローチは通常、LLM推論を効率性または品質のいずれかで最適化するが、両方を達成するのに苦労しているため、現代のWebプラットフォームの二重要求を満たすことができない。これらの制限を克服するために,依存性を意識したクエリの分解と論理並列コンテンツの拡張を可能にする,新規かつ効率的な推論フレームワークOrionを提案する。具体的には、Orionは単一のクエリ推論プロセスを2つの相乗的フェーズに分解する: (1) 拡張された数ショットプロンプトによって論理的に構造化されたキーポイントを蒸留する \textit{key point generation} と (2) 依存性グラフに基づいてこれらのポイントを同時に精査し、論理的一貫性を確保する。さらに、Orionは2つのフェーズの補完的な計算特性(世代はGPUコンピューティングに圧力を課し、GPUメモリに拡張ストレスを課す)を複数のクエリにわたって活用するパイプラインスケジューリング機構を導入し、クロスクエリ並列化を可能にし、推論性能(生産性、効率、品質)を劇的に改善する。さまざまなベンチマークの実験によると、Orionはトークン生成速度を最大4.33倍、応答遅延を3.42倍に向上するだけでなく、ポイント間の依存関係を明示的にモデル化することで、推論品質を最大18.75%向上させる。

論文の概要: Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion

関連論文リスト