Fugu-MT 論文翻訳(概要): Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

論文の概要: Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

arxiv url: http://arxiv.org/abs/2604.21776v2
Date: Fri, 24 Apr 2026 04:18:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 13:34:22.119345
Title: Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Title（参考訳）: Re shoot-Anything:Wildビデオの再撮影のためのセルフ・スーパービジョン・モデル
Authors: Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar,
Abstract要約: インターネット規模のモノクロビデオを活用するためのフレームワークを構築した。私たちのコアコントリビューションは、ソースビデオ、幾何アンカー、ターゲットビデオからなる擬似多視点トレーニング三脚の生成です。提案する拡散変圧器は4Dポイントクラウド誘導アンカーを用いて,最先端の時間的整合性を実現する。
参考スコア（独自算出の注目度）: 3.1328424544428852
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.
Abstract（参考訳）: ダイナミックビデオの再撮影のための精密カメラ制御は、厳密でないシーンのためのペアリングされたマルチビューデータの深刻な不足によってボトルネックとなる。この制限を、インターネット規模のモノクロビデオを活用する、高度にスケーラブルなセルフ教師付きフレームワークで克服する。私たちのコアコントリビューションは、ソースビデオ、幾何アンカー、ターゲットビデオからなる擬似多視点トレーニング三脚の生成です。本研究では,単一入力ビデオから異なるランダムウォーク農作物軌跡を抽出し,ソースおよびターゲットビューとして機能させることにより,これを実現する。アンカーは、ソースの第1フレームを濃密な追跡フィールドで前方にウォープすることで合成され、推論時に期待される歪んだ点クラウド入力を効果的にシミュレートする。我々の独立した収穫戦略は、空間的不整合と人工閉塞をもたらすため、モデルは、現在のソースフレームからの情報を単純にコピーすることはできない。その代わりに、4次元時空間構造を暗黙的に学習し、異なる時間と視点で欠落した高忠実なテクスチャを積極的にルーティングして再投影し、ターゲットを再構築する。推定では,4Dポイントクラウド誘導アンカーを用いて,複雑な動的シーンにおける時間的整合性,ロバストなカメラ制御,高忠実なノベルビュー合成を実現する。

論文の概要: Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

関連論文リスト