Fugu-MT 論文翻訳(概要): EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

論文の概要: EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

arxiv url: http://arxiv.org/abs/2604.26417v1
Date: Wed, 29 Apr 2026 08:27:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.313721
Title: EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses
Title（参考訳）: EmoTransCap: 言論における感情遷移を考慮した音声キャプションのためのデータセットとパイプライン
Authors: Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu,
Abstract要約: 本研究では、時間的感情動態と談話レベルの音声記述を統合するパラダイムである感情遷移対応音声キャプション(EmoTransCap)を提案する。これは、談話レベルの感情遷移を捉えるために明示的に設計された最初の大規模データセットである。また、言論レベルにおいて、制御可能で遷移対応の感情音声合成システムを導入する。
参考スコア（独自算出の注目度）: 25.739767606548313
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.
Abstract（参考訳）: 感情知覚と適応表現は、人間とエージェントの相互作用の基本的な機能である。音声感情キャプション(SEC)の最近の進歩は、きめ細かな感情モデリングを改善しているが、既存のシステムは、言論レベルでの動的な感情遷移を無視しながら、孤立した文内の静的な単感情的特徴に限られている。このギャップに対処するために、言論レベルの音声記述と時間的感情力学を統合するパラダイムである感情遷移認識音声キャプション(EmoTransCap)を提案する。スケーラブルな拡張を可能にしつつ、感情遷移に富んだデータセットを構築するために、データセット作成のための自動パイプラインを設計する。これは、談話レベルの感情遷移を捉えるために明示的に設計された最初の大規模データセットである。意味的に豊かな記述を生成するために、談話レベルの音声から音響特性と時間的手がかりを組み込む。我々のマルチタスク感情遷移認識(MTETR)モデルは,共同感情遷移検出とダイアリゼーションを行う。 LLMのセマンティック分析機能を活用することで、記述型と命令指向型の2つのアノテーションバージョンを生成する。これらのデータとアノテーションは、感情の知覚と感情の表現性を促進するための貴重なリソースを提供する。このデータセットは、感情の遷移を捉え、時間的ダイナミックできめ細かい感情理解を容易にする音声キャプションを可能にする。また、言論レベルにおいて、制御可能な、トランジション対応の感情音声合成システムを導入し、人為的感情表現性を高め、感情的知的会話エージェントを支援する。

論文の概要: EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

関連論文リスト