Fugu-MT 論文翻訳(概要): Sound Sparks Motion: Audio and Text Tuning for Video Editing

論文の概要: Sound Sparks Motion: Audio and Text Tuning for Video Editing

arxiv url: http://arxiv.org/abs/2605.15307v1
Date: Thu, 14 May 2026 18:20:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.054608
Title: Sound Sparks Motion: Audio and Text Tuning for Video Editing
Title（参考訳）: Sound Sparks Motion:ビデオ編集のためのオーディオとテキストチューニング
Authors: AmirHossein Naghi Razlighi, Aryan Mikaeili, Ali Mahdavi-Amiri, Daniel Cohen-Or, Yiorgos Chrysanthou,
Abstract要約: 本研究では、オーディオ映像生成モデルにおけるモーション編集を可能にする、トレーニング不要のフレームワークであるSound Sparks Motionを紹介する。 Sound Sparks Motionはテスト時に内部のマルチモーダルコンディショニング信号を調整します。この結果から,マルチモーダル・コンディショニング・チューニングをモーション対応ビデオ編集の有望な方向として強調した。
参考スコア（独自算出の注目度）: 53.136757756110626
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model's multimodal conditioning. Code and data are available via our project page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/
Abstract（参考訳）: モーション中心のビデオ編集は、外見の変化によく反応するが、既存のクリップで特定の局所的なアクションや状態遷移を生成するのに苦労する大規模な生成ビデオモデルにとって、依然として困難である。本研究では,テスト時に内部マルチモーダルコンディショニング信号を調整することで,オーディオ映像生成モデルにおけるモーション編集を可能にする,トレーニング不要のフレームワークであるSound Sparks Motionを紹介する。モデル重みを変更するのではなく、本手法は、音源映像から派生した音声潜在変数と、テキストコンディショニングにおける残摂動の2つの軽量変数をチューニングする。この組み合わせは、プロンプトのみの制御下では、基礎となるモデルが実現に苦慮する動きの編集を促進することができる。テキストと動きの時間的アライメントを評価する直接的な方法がないため、生成した動画に意図された動きが現れるかどうかを示すフィードバックを提供する視覚言語モデルを用いて、チューニングプロセスのガイドを行う。この単純な監督は、動きの編集に効果的な意味的目的を与える一方、正規化と知覚的時間的制約は、内容と視覚的品質の保存に役立つ。ビデオ単位のチューニング以外にも、学習した潜在制御がビデオ間で転送可能であることを示し、単一の例に過度に適合するのではなく、再利用可能なモーション編集方向をキャプチャできることを示唆している。本研究では,特に音声経路を経由したマルチモーダル・コンディショニング・チューニングをモーション対応ビデオ編集の有望な方向として強調し,テストタイム・チューニングが,モデルのマルチモーダル・コンディショニングに埋め込まれた遅延動作制御を明らかにするための軽量なプロブリング機構として機能することを示唆した。コードとデータはプロジェクトのページから入手できます。

論文の概要: Sound Sparks Motion: Audio and Text Tuning for Video Editing

関連論文リスト