Fugu-MT 論文翻訳(概要): AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

論文の概要: AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

arxiv url: http://arxiv.org/abs/2606.14591v1
Date: Fri, 12 Jun 2026 16:09:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:42.977826
Title: AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models
Title（参考訳）: AudioDER: 学習後の大規模オーディオ言語モデルのための重複強化推論データセット
Authors: Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu,
Abstract要約: LALM(Large Audio-Language Models)は、幅広い音声理解タスクにおいて強力なパフォーマンスを示しているが、複雑な音声推論に苦戦している。既存のオーディオ言語データセットは、多くのサンプルが音響内容に非常によく似ている、かなりの冗長性を含んでいることが多い。 LALMのための推論指向の監視を構築するための冗長性を考慮したデータ構築パイプラインを提案する。
参考スコア（独自算出の注目度）: 42.62457130960257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.
Abstract（参考訳）: LALM(Large Audio-Language Models)は、幅広い音声理解タスクにおいて強力なパフォーマンスを示しているが、複雑な音声推論に苦戦している。このような能力を改善するための実践的な方法はポストトレーニングであり、その効果はトレーニングデータの質と多様性に大きく依存する。しかし、既存の音声言語データセットは、多くのサンプルが音響的内容に非常に類似しており、重なり合うオーバシィ信号を提供するような、かなりの冗長性を含んでいることが多い。このような冗長性はアノテーションのコストを増加させるだけでなく、コーパスの多様性を制限し、ポストトレーニングの有効性を低下させる。この問題に対処するために,LALMのための推論指向の監視を構築するための冗長性を考慮したデータ構築パイプラインを提案する。具体的には、まず、音響的類似度に基づく生音声データセット間の重複処理を行い、コーパスの多様性を改善する。次に,既存の音声キャプションと質問応答ペアを統合された複数選択形式に統合する。これらの統一アノテーションに基づいて、我々はQwen3-30Bを活用して、推論指向の監視のためのチェーン・オブ・シント(CoT)論理を生成する。このパイプラインに基づいて、音声、音声、音楽にまたがる約191万のサンプルを含む推論指向のポストトレーニングデータセットであるAudioDERを構築した。各サンプルは,音声クリップ,複数選択質問,4つの回答候補,音声キャプション,CoT合理化からなる。広汎な実験により、MMAU-mini、MMSU、MMARを含む複数の音響推論ベンチマークにおけるQwen2-Audio-7B-Instructのパフォーマンスが一貫して改善された。我々は,AudioDERが音声推論研究の進展と,より有能なLALMの開発に有用な資源になることを期待している。

論文の概要: AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

関連論文リスト