Fugu-MT 論文翻訳(概要): Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

論文の概要: Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

arxiv url: http://arxiv.org/abs/2509.21749v1
Date: Fri, 26 Sep 2025 01:27:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.112261
Title: Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
Title（参考訳）: 音で考える:大規模オーディオ言語モデルでマルチモーダル推論を可能にするAudio Chain-of-Thought
Authors: Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, Yiwei Wang,
Abstract要約: 本稿では,Large Audio-Language ModelsとAudio CoTを併用したThinking-with-Sound(TwS)を提案する。 TwSにより、モデルは音声信号で積極的に考えることができ、数値解析やマルチモーダル推論によるデジタル操作を行うことができる。実験によると、最先端のLALMはMELD-Hard1kで劇的に性能が低下しており、クリーンオーディオに比べて精度が50%以上低下している。
参考スコア（独自算出の注目度）: 49.097347801692166
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q\&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50\%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73\%$ absolute accuracy, with improvements scaling consistently up to $36.61\%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.
Abstract（参考訳）: 近年のLarge Audio-Language Models (LALM) は,音声翻訳やAudio Q\&Aなど,様々な音声理解タスクにおいて高いパフォーマンスを示している。しかし、複雑な音響シナリオにおける難解な音声推論タスクには、大きな制限がある。これらの状況は、ノイズ抑制、音源分離、正確な時間的アライメントといった音響ツールを使用することで大きな恩恵を受けるが、現在のLALMはそのようなツールへのアクセスを欠いている。この制限に対処するために,言語推論とオンザフライオーディオドメイン分析を組み合わせることで,LALMをオーディオCoTに組み込むフレームワークであるThinking-with-Sound(TwS)を紹介した。音声を静的入力として扱う既存のアプローチとは異なり、TwSは音声信号を積極的に考えることができ、数値解析やマルチモーダル推論によるデジタル操作を行う。このアプローチを評価するために,様々な音響摂動を導入した新しい頑健性ベンチマークMELD-Hard1kを構築した。実験によると、最先端のLALMはMELD-Hard1kで劇的に性能が低下しており、クリーンオーディオと比較して精度が50\%以上低下している。 TwSはロバスト性を大幅に改善し、有効性とスケーラビリティの両方を実証する: 小型モデルは24.73 %$絶対精度を獲得し、大型モデルは36.61 %$まで継続的にスケールする。以上の結果から,Audio CoTはリトレーニングを伴わずにロバスト性を大幅に向上し,よりロバストな音声理解システムを開発するための新たな方向性を開拓できることが示唆された。

論文の概要: Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

関連論文リスト