Fugu-MT 論文翻訳(概要): Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

論文の概要: Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

arxiv url: http://arxiv.org/abs/2509.23938v1
Date: Sun, 28 Sep 2025 15:29:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.544992
Title: Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems
Title（参考訳）: 簡単なターン:全二重音声対話システムにおけるロバストなターンタイキングのための音響・言語モダリティの統合
Authors: Guojian Li, Chengyou Wang, Hongfei Xue, Shuiyuan Wang, Dehui Gao, Zihan Zhang, Yuke Lin, Wenjie Li, Longshuai Xiao, Zhonghua Fu, Lei Xie,
Abstract要約: Easy Turnはオープンソースのモジュール型ターンテイク検出モデルである。音声と言語によるバイモーダル情報を統合し、対話のターン状態を予測する。データとモデルはGitHubで公開される予定だ。
参考スコア（独自算出の注目度）: 24.67635563417753
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn, an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.
Abstract（参考訳）: しかし、システムがいつ話すべきか、耳を傾けたり、沈黙し続けるべきかを決定するためには、堅牢なターンテイク検出が必要であるため、依然として困難である。既存のソリューションは専用のターンテイクモデルに依存しているが、そのほとんどはオープンソースではない。利用可能な数少ないものはその大きなパラメータサイズによって制限されるか、音響や言語のような単一のモダリティのみをサポートすることによって制限される。あるいは、LLMバックボーンを微細化してフルデュプレックス機能を実現する方法もあるが、これはオープンソース形式では不十分な大量のフルデュプレックスデータを必要とする。これらの問題に対処するため,オープンソースのモジュール型ターンテイク検出モデルであるEasy Turnを提案する。これは,ターンテイク検出モデルのトレーニング用に設計された1,145時間の音声データセットであるEasy Turn Trainetのリリースに伴って,音響的および言語的バイモーダル情報を統合して,4つのダイアログ状態(完全,不完全,バックチャネル,待機)を予測する。 TEN Turn DetectionやSmart Turn V2といった既存のオープンソースモデルと比較して、私たちのモデルはオープンソースのEasy Turnテストセットで最先端のターンテイク検出精度を実現しています。データとモデルはGitHubで公開される予定だ。

論文の概要: Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

関連論文リスト