Fugu-MT 論文翻訳(概要): Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations

論文の概要: Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations

arxiv url: http://arxiv.org/abs/2603.13824v1
Date: Sat, 14 Mar 2026 08:12:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.427495
Title: Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations
Title（参考訳）: 制御されたプロンプト摂動下におけるテキスト・ツー・オーディオ生成システムにおける意味的脆弱性の評価
Authors: Jiahui Wu,
Abstract要約: 小さな言語的変化は、生成された音声にかなりの変化をもたらし、実用的な使用における信頼性への懸念を引き起こす可能性がある。制御された即時摂動下でのテキスト・音声システムの意味的不安定性を評価する。
参考スコア（独自算出の注目度）: 2.2870073664564115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen-large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text-to-audio generation and highlights the need for multi-level stability assessment in generative audio systems.
Abstract（参考訳）: 近年のテキスト音声生成の進歩により、自然言語記述を多様な音楽出力に変換することができるようになった。しかし、意味論的に等価なプロンプト変奏法の下でのこれらのシステムの堅牢性はほとんど未解明のままである。小さな言語的変化は、生成された音声にかなりの変化をもたらし、実用的な使用における信頼性への懸念を引き起こす可能性がある。本研究では,制御された即時摂動下でのテキスト・音声システムの意味的脆弱性を評価する。代表モデルとしてMusicGen-small,MusicGen-large,Stable Audio 2.5を選択し,MLS(Minimum Lexical Substitution),IS(Intensity Shifts),SR(Structuor Rephrasing)を用いて評価した。提案したデータセットは、局所的な言語変化を導入しながら意味的意図を維持するように設計された75のプロンプト群を含む。生成した出力は、相補的なスペクトル、時間的、意味的類似度測定によって比較され、複数の表現レベルにわたる堅牢性解析を可能にする。実験の結果、より大きなモデルでは意味的一貫性が向上し、 MusicGen-large は MLS では 0.77 、IS では 0.82 のコサイン類似性に達した。しかしながら、音響的および時間的分析は、埋め込み類似性が高いままであっても、すべてのモデルに永続的なばらつきを示す。これらの結果から,マルチモーダル埋め込みアライメントよりも,主に意味-音響的実現の過程で生じる脆弱性が示唆された。本研究は,テキスト音声生成におけるロバスト性を評価するための制御フレームワークを導入し,生成音声システムにおけるマルチレベル安定性評価の必要性を強調した。

論文の概要: Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations

関連論文リスト