Fugu-MT 論文翻訳(概要): From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

論文の概要: From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

arxiv url: http://arxiv.org/abs/2409.19132v1
Date: Fri, 27 Sep 2024 20:26:34 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-06 04:21:02.511528
Title: From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Title（参考訳）: 視覚から音声へ:映像表現と生成のための統一モデル
Authors: Kun Su, Xiulong Liu, Eli Shlizerman,
Abstract要約: 本稿では,視覚表現学習と視覚音声生成のギャップを埋める新しいフレームワークであるVision to Audio and Beyond(VAB)を紹介する。 VABは、事前訓練されたオーディオトークンライザと画像エンコーダを使用して、それぞれ音声トークンと視覚的特徴を取得する。実験では,ビデオから高品質な音声を生成するためのVABの効率と,セマンティック・オーディオ・視覚的特徴を習得する能力について紹介した。
参考スコア（独自算出の注目度）: 17.95017332858846
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video encompasses both visual and auditory data, creating a perceptually rich experience where these two modalities complement each other. As such, videos are a valuable type of media for the investigation of the interplay between audio and visual elements. Previous studies of audio-visual modalities primarily focused on either audio-visual representation learning or generative modeling of a modality conditioned on the other, creating a disconnect between these two branches. A unified framework that learns representation and generates modalities has not been developed yet. In this work, we introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. The key approach of VAB is that rather than working with raw video frames and audio data, VAB performs representation learning and generative modeling within latent spaces. In particular, VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively. It then performs the pre-training task of visual-conditioned masked audio token prediction. This training strategy enables the model to engage in contextual learning and simultaneous video-to-audio generation. After the pre-training phase, VAB employs the iterative-decoding approach to rapidly generate audio tokens conditioned on visual features. Since VAB is a unified model, its backbone can be fine-tuned for various audio-visual downstream tasks. Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features, leading to competitive results in audio-visual retrieval and classification.
Abstract（参考訳）: ビデオは視覚的データと聴覚的データの両方を含み、これら2つのモードが相互に補完する知覚的に豊かな体験を生み出す。このように、ビデオはオーディオと視覚要素の相互作用を調査するための貴重なメディアである。従来のオーディオ・視覚的モダリティの研究は、主にオーディオ・視覚的表現学習と、他方に条件付けられたモダリティの生成的モデリングに焦点を合わせ、これら2つのブランチ間の接続を切断した。表現を学習し、モダリティを生成する統一フレームワークはまだ開発されていない。本研究では,音声・視覚表現学習と視覚・音声生成のギャップを埋める新しいフレームワークであるVision to Audio and Beyond(VAB)を紹介する。 VABの主なアプローチは、生のビデオフレームやオーディオデータを扱うのではなく、潜在空間内で表現学習と生成モデルを実行することである。特に、VABは、事前訓練されたオーディオトークンライザと画像エンコーダを使用して、それぞれ音声トークンと視覚的特徴を取得する。次に、視覚条件付きマスク付きオーディオトークン予測の事前学習タスクを実行する。このトレーニング戦略により、コンテキスト学習と同時ビデオ・オーディオ生成を行うことが可能になる。事前学習フェーズの後、VABは反復復号方式を採用し、視覚的特徴に応じた音声トークンを迅速に生成する。 VABは統一モデルであるため、バックボーンは様々なオーディオ・ビジュアル・ダウンストリームタスクのために微調整できる。実験では,映像から高品質な音声を合成する上でのVABの効率と,その意味的音声視覚特徴の獲得能力を示し,音声視覚検索と分類の競争力に繋がる結果を得た。

論文の概要: From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

関連論文リスト