Fugu-MT 論文翻訳(概要): NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

論文の概要: NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

arxiv url: http://arxiv.org/abs/2604.18105v1
Date: Mon, 20 Apr 2026 11:21:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.830893
Title: NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
Title（参考訳）: NIM4-ASR: 効率的でロバストでカスタマイズ可能なリアルタイムLLMベースASRを目指して
Authors: Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu,
Abstract要約: 我々は、効率性とロバスト性の両方に最適化された生産指向LLMベースのASRフレームワークであるNIM4-ASRを提案する。トレーニング前アーキテクチャと目的を再構築し、モダリティギャップを緩和し、パラメータ効率を向上させる。さらに、ノイズやサイレントな条件下での堅牢性を含む、生産指向の最適化スイートも組み込んでいます。
参考スコア（独自算出の注目度）: 22.527587147157462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.
Abstract（参考訳）: 近年,大規模言語モデル (LLM) を自動音声認識 (ASR) に統合することが主流となっている。既存のLLMベースのASRモデルは、公開ベンチマークで素晴らしいパフォーマンスを示しているが、トレーニングは主にデータ駆動であり、特にリソース制約されたデプロイメントにおける下向きのスケーラビリティや、音響的に困難な条件下での幻覚といった、主要な現実的な課題に対処するには不十分なままである。これらの課題に対処するため,本研究では,効率性とロバスト性の両方に最適化された生産指向LLMベースのASRフレームワークであるNIM4-ASRを提案する。エンコーダとLLMの機能的役割の原則的記述に基づいて,各モジュールを意図した能力境界に整合させるため,多段階トレーニングパラダイムを再設計する。具体的には、モーダルギャップを緩和し、パラメータ効率を向上させるための事前学習アーキテクチャと目的を再構築し、音響忠実度と制約表現のドリフトを維持するための反復非同期SFTステージを導入し、音声認識品質とロバスト性を高めるためにASR特化強化学習ステージを設計する。さらに、ノイズやサイレント条件下での堅牢性、リアルタイムストリーミング推論、検索拡張生成(RAG)によるホットワードのカスタマイズなど、生産指向の最適化のスイートも組み込んだ。実験によると、NIM4-ASRは2.3Bパラメータのみを持つ複数の公開ベンチマークで最先端のパフォーマンスを達成し、特にエンティティ集約の現実のシナリオにおいて、内部ベンチマークでの大規模ライバルよりも大幅に上回っている。 NIM4-ASRはさらに、100ミリ秒以下の検索レイテンシでRAGによる百万単位のホットワードのカスタマイズをサポートし、新興エンティティとパーソナライズされたユーザ要求への効率的な適応を可能にする。

論文の概要: NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

関連論文リスト