Fugu-MT 論文翻訳(概要): CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents

論文の概要: CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents

arxiv url: http://arxiv.org/abs/2606.21453v1
Date: Fri, 19 Jun 2026 14:11:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 13:22:10.916877
Title: CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents
Title（参考訳）: CORTIS:タスク指向音声エージェントのための音声言語モデルのテキストオンリー適応
Authors: Youngwon Choi, Hyeonyu Kim, Taeyoun Kwon, Donghyuk Jung, Myeongkyun Cho,
Abstract要約: タスク指向音声エージェントのためのテキストのみ適応フレームワークであるCORTISを提案する。音声に基づく構造化出力生成を可能にするテキスト形式のタスク管理を用いたCORTISファインチューンSLM その結果, CORTISは整合カスケードと競合し, 音響劣化下ではより明確な優位性を示すことがわかった。
参考スコア（独自算出の注目度）: 1.681860865621691
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task-oriented voice agents need to map spoken user requests to structured outputs such as semantic frames, executable actions, and function calls. A common approach is to cascade ASR with a text-based LLM, but transcription errors can propagate to downstream structured output generation, especially under noisy conditions. Spoken language models (SLMs) offer a direct speech-based alternative, yet adapting them to new tasks typically requires paired speech-target annotations. Motivated by this gap, we present CORTIS, a text-only adaptation framework for task-oriented voice agents. CORTIS fine-tunes SLMs using text-form task supervision, enabling speech-based structured output generation at inference time without task-specific speech-target annotations during adaptation. We evaluate CORTIS on two Qwen2.5-Omni backbones and three task-oriented speech datasets, including an in-house product dataset, and compare it with matched ASR-LLM cascades trained with the same text-form task supervision. Results show that CORTIS performs competitively with matched cascades and offers clearer advantages under acoustic degradation, particularly in preserving high-level task semantics. These findings suggest that text-only fine-tuning of SLMs can serve as a practical adaptation strategy for voice agents when paired speech-target data are costly to collect.
Abstract（参考訳）: タスク指向の音声エージェントは、音声のユーザリクエストをセマンティックフレーム、実行可能なアクション、関数呼び出しなどの構造化された出力にマッピングする必要がある。一般的なアプローチは、テキストベースのLLMでASRをカスケードするが、特にノイズのある条件下では、転写エラーは下流の構造化出力生成に伝播する。音声言語モデル(SLM)は、直接音声ベースの代替手段を提供するが、これらを新しいタスクに適用するには、通常、ペアの音声ターゲットアノテーションが必要である。そこで本研究では,タスク指向音声エージェントのためのテキストのみ適応フレームワークであるCORTISを提案する。テキスト形式のタスク監視を用いたCORTISファインチューンSLMは、タスク固有の音声ターゲットアノテーションを使わずに、推論時に音声ベースの構造化出力を生成する。我々は,Qwen2.5-Omniの2つのバックボーンと社内製品データセットを含む3つのタスク指向音声データセット上でCORTISを評価し,同じテキスト形式のタスク管理で訓練されたASR-LLMカスケードと比較した。その結果,CORTISは一致したカスケードと競合し,特にハイレベルなタスクセマンティクスの保存において,音響劣化下でより明確なアドバンテージを提供することがわかった。これらの結果から,SLMのテキストのみの微調整は,ペア音声データ収集にコストがかかる場合に,音声エージェントの実践的適応戦略として有効であることが示唆された。

論文の概要: CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents

関連論文リスト