Fugu-MT 論文翻訳(概要): Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

論文の概要: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

arxiv url: http://arxiv.org/abs/2509.14627v1
Date: Thu, 18 Sep 2025 05:14:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.076011
Title: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Title（参考訳）: 音声生成によるマルチモーダル対話エージェントの実現に向けて
Authors: Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim,
Abstract要約: 本研究では,会話のムードと応答型情報に基づいて,音声応答を生成するヒューマンライクなエージェントを提案する。エージェントが自然言語を生成できるようにするために,音声に着目した新しいマルチセンサ会話データセットを構築した。実験結果から,会話における視覚的・音声的モダリティの両面を利用した係り受け音声生成の有効性が示された。
参考スコア（独自算出の注目度）: 10.576716279533404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC
Abstract（参考訳）: 人間の会話には言語、スピーチ、視覚的手がかりが含まれ、各媒体は補完的な情報を提供する。例えば、音声はバイブやトーンを、テキストだけでは完全に捉えられていない。マルチモーダルLLMは多様な入力からテキスト応答を生成することに重点を置いているが、自然で魅力的な音声を生成することにはあまり注意が払われていない。本研究では,会話のムードと応答型情報に基づいて,音声応答を生成するヒューマンライクなエージェントを提案する。これを実現するために、エージェントが自然な音声を生成できるようにするために、音声に焦点を当てた新しいマルチセンサ会話データセットを構築した。次に,テキスト応答と音声記述を生成するマルチモーダルLLMモデルを提案する。実験結果から,会話における視覚的・音声的モダリティの両面を利用した係り受け音声生成の有効性が示された。ソースコードはhttps://github.com/kimtaesu24/MSenCで入手できる。

論文の概要: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

関連論文リスト