Fugu-MT 論文翻訳(概要): Few-shot Acoustic Synthesis with Multimodal Flow Matching

論文の概要: Few-shot Acoustic Synthesis with Multimodal Flow Matching

arxiv url: http://arxiv.org/abs/2603.19176v1
Date: Thu, 19 Mar 2026 17:32:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.300667
Title: Few-shot Acoustic Synthesis with Multimodal Flow Matching
Title（参考訳）: マルチモーダルフローマッチングによる音響合成
Authors: Amandine Brunetto,
Abstract要約: 本稿では,数発の音響合成のための確率的手法であるフローマッチング音響生成(FLAC)を紹介する。 FLACは、最先端の8ショットベースラインを2つのデータセットで1ショットで上回る。この研究は、生成フローマッチングを明示的なRIR合成に適用し、ロバストでデータ効率の良い音響合成のための新しい方向を確立する最初のものである。
参考スコア（独自算出の注目度）: 1.0742675209112622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
Abstract（参考訳）: 没入型仮想環境において、シーンと音響的に整合した音声を生成することが不可欠である。近年のニューラル・アコースティック・フィールド法は、空間的に連続した音像のレンダリングを可能にするが、シーン固有のままであり、高密度の音響測定と各環境に対する費用のかかる訓練を必要としている。部屋間のスケーラビリティを改善するアプローチはほとんどないが、それでも複数の録音に依存しており、決定論的でありながら、スパースコンテキスト下でのシーンアコースティックの固有の不確実性を捉えることができない。本稿では,最小シーン環境下での可視室インパルス応答 (RIR) の分布をモデル化する,少数ショット音響合成のための確率論的手法であるフローマッチング音響生成(FLAC)を紹介する。 FLACは、フローマッチングの目的で訓練された拡散トランスフォーマーを利用して、空間的、幾何学的、音響的手がかりに基づいて、新しいシーンで任意の位置でIRRを生成する。 FLACは、AtlassianRoomsとHearing Anything Anywhereのデータセットの両方で、最先端の8ショットベースラインを1ショットで上回る。標準的な知覚測度を補完するため、我々はさらに、統合音響幾何学の埋め込みであるAGREEを導入し、検索および分布測度を通して生成したRIRの幾何整合性評価を可能にした。この研究は、生成フローマッチングを明示的なRIR合成に適用し、ロバストでデータ効率の良い音響合成のための新しい方向を確立する最初のものである。

論文の概要: Few-shot Acoustic Synthesis with Multimodal Flow Matching

関連論文リスト