Fugu-MT 論文翻訳(概要): SeeingSounds: Learning Audio-to-Visual Alignment via Text

論文の概要: SeeingSounds: Learning Audio-to-Visual Alignment via Text

arxiv url: http://arxiv.org/abs/2510.11738v1
Date: Fri, 10 Oct 2025 18:42:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.020366
Title: SeeingSounds: Learning Audio-to-Visual Alignment via Text
Title（参考訳）: SeeingSounds: テキストによる音声と視覚のアライメントの学習
Authors: Simone Carnemolla, Matteo Pennisi, Chiara Russo, Simone Palazzo, Daniela Giordano, Concetto Spampinato,
Abstract要約: 本稿では,音声,言語,視覚の相互作用を利用した画像生成のためのフレームワークであるSeeingSoundsを紹介する。音声は凍結言語エンコーダを介して意味言語空間に投影され、視覚言語モデルを用いて文脈的に視覚領域に基底される。このアプローチは認知神経科学にインスパイアされ、人間の知覚で観察される自然な相互関連を反映している。
参考スコア（独自算出の注目度）: 15.011814561603964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., "a distant thunder") that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.
Abstract（参考訳）: SeeingSoundsは、オーディオ、言語、視覚の相互作用を活用する軽量でモジュラーなオーディオ画像生成フレームワークで、ペアのオーディオ視覚データや視覚生成モデルのトレーニングを必要とせずに導入する。本手法は,音声をテキストの代用として扱うか,あるいは音声からテキストへのマッピングにのみ依存するのではなく,音声を凍結言語エンコーダを介して意味言語空間に投影し,視覚言語モデルを用いて視覚領域にコンテキスト的に接地する。このアプローチは認知神経科学にインスパイアされ、人間の知覚で観察される自然な相互関連を反映している。このモデルは凍結拡散バックボーンで動作し、軽量アダプタのみを訓練し、効率的でスケーラブルな学習を可能にする。さらに、プロシージャテキストプロンプト生成による微粒化および解釈可能な制御をサポートし、音声変換(例えば音量やピッチシフト)が視覚出力を導く記述的プロンプト(例えば「遠方の雷」)に変換される。標準ベンチマークの広範な実験により、SeeeingSoundsはゼロショットと教師付きセッティングの両方で既存のメソッドよりも優れており、制御可能なオーディオ・ビジュアル生成における新しい最先端技術を確立していることが確認された。

論文の概要: SeeingSounds: Learning Audio-to-Visual Alignment via Text

関連論文リスト