Fugu-MT 論文翻訳(概要): Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models

論文の概要: Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models

arxiv url: http://arxiv.org/abs/2603.20697v1
Date: Sat, 21 Mar 2026 07:47:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.04548
Title: Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models
Title（参考訳）: サテライト・ツー・ストリート:生成視覚モデルによる衛星画像からのポスト・ディザスタ・ビューの合成
Authors: Yifan Yang, Lei Zou, Wendy Jepson,
Abstract要約: 衛星画像から災害後のストリートビューを合成するための2つの生成戦略を導入する。提案した構造認識評価フレームワークを用いて,汎用ベースライン(Pix2Pix, ControlNet)に対してベンチマークを行った。実験の結果、標準の ControlNet が最も高いセマンティック精度が 0.71 であるのに対し、VLM と MoE のモデルはテキストの可視性では優れるが、意味的明瞭性に苦慮している。
参考スコア（独自算出の注目度）: 10.715667868976054
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism--fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.
Abstract（参考訳）: 自然災害の直後には、急激な状況認識が重要である。伝統的に、衛星観測は損傷の程度を推定するために広く用いられている。しかし、それらは特定の構造的失敗と影響を特徴づけるのに不可欠な基盤レベルの視点を欠いている。一方、地上レベルのデータ(例えばストリートビュー画像)は、時間に敏感なイベントの間、ほとんどアクセスできない。本研究では,このデータギャップを橋渡しするための衛星・ストリートビュー合成について検討する。本稿では,衛星画像から災害後のストリートビューを合成するための2つの生成戦略について紹介する。提案した構造認識評価フレームワークを用いて,汎用ベースライン(Pix2Pix, ControlNet)に対してベンチマークを行った。本プロトコルは,(1)画素レベルの品質評価,(2)ResNetに基づく意味的整合性検証,(3)知覚アライメントのための新しいVLM-as-a-Judgeを統合する。拡散に基づくアプローチ(例えばコントロールネット)は高い知覚的リアリズムを実現する一方で、構造的詳細を幻覚させることが多い。定量的な結果から,標準制御ネットは最大セマンティック精度0.71を達成しているのに対し,VLM強化モデルとMoEモデルはテキストの可読性では優れるが,意味的明瞭性に苦慮していることがわかった。この研究は、信頼性の高いクロスビュー合成のベースラインを確立し、視覚的に現実的な世代は、信頼できる災害評価に必要な重要な構造情報の保存に失敗する可能性があることを強調した。

論文の概要: Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models

関連論文リスト