GeoSynth: Contextually-Aware High-Resolution Satellite Image Synthesis
- URL: http://arxiv.org/abs/2404.06637v1
- Date: Tue, 9 Apr 2024 22:16:34 GMT
- Title: GeoSynth: Contextually-Aware High-Resolution Satellite Image Synthesis
- Authors: Srikumar Sastry, Subash Khanal, Aayush Dhakal, Nathan Jacobs,
- Abstract summary: We present a model for synthesizing satellite images with global style and image-driven layout control.
We train our model on a large dataset of paired satellite imagery, with automatically generated captions, and OpenStreetMap data.
Results demonstrate that our model can generate diverse, high-quality images and exhibits excellent zero-shot generalization.
- Score: 7.822924588609674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present GeoSynth, a model for synthesizing satellite images with global style and image-driven layout control. The global style control is via textual prompts or geographic location. These enable the specification of scene semantics or regional appearance respectively, and can be used together. We train our model on a large dataset of paired satellite imagery, with automatically generated captions, and OpenStreetMap data. We evaluate various combinations of control inputs, including different types of layout controls. Results demonstrate that our model can generate diverse, high-quality images and exhibits excellent zero-shot generalization. The code and model checkpoints are available at https://github.com/mvrl/GeoSynth.
Related papers
- AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks [23.041812897803034]
We propose Any Synth, a unified framework capable of generating arbitrary type of synthetic data.
We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Image Retrieval, and Multi-modal Image Perception and Grounding.
arXiv Detail & Related papers (2024-11-24T04:49:07Z) - CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis [54.852701978617056]
CrossViewDiff is a cross-view diffusion model for satellite-to-street view synthesis.
To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules.
To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method.
arXiv Detail & Related papers (2024-08-27T03:41:44Z) - GEOBIND: Binding Text, Image, and Audio through Satellite Images [7.291750095728984]
We present a deep-learning model, GeoBind, that can infer about multiple modalities, specifically text, image, and audio, from satellite imagery of a location.
Our training results in a joint embedding space with multiple types of data: satellite image, ground-level image, audio, and text.
arXiv Detail & Related papers (2024-04-17T20:13:37Z) - DiffusionSat: A Generative Foundation Model for Satellite Imagery [63.2807119794691]
We present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets.
Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting.
arXiv Detail & Related papers (2023-12-06T16:53:17Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - CoGS: Controllable Generation and Search from Sketch and Style [35.625940819995996]
We present CoGS, a method for the style-conditioned, sketch-driven synthesis of images.
CoGS enables exploration of diverse appearance possibilities for a given sketched object.
We show that our model, trained on the 125 object classes of our newly created Pseudosketches dataset, is capable of producing a diverse gamut of semantic content and appearance styles.
arXiv Detail & Related papers (2022-03-17T18:36:11Z) - SemanticStyleGAN: Learning Compositional Generative Priors for
Controllable Image Synthesis and Editing [35.02841064647306]
StyleGANs provide promising prior models for downstream tasks on image synthesis and editing.
We present SemanticStyleGAN, where a generator is trained to model local semantic parts separately and synthesizes images in a compositional way.
arXiv Detail & Related papers (2021-12-04T04:17:11Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Example-Guided Image Synthesis across Arbitrary Scenes using Masked
Spatial-Channel Attention and Self-Supervision [83.33283892171562]
Example-guided image synthesis has recently been attempted to synthesize an image from a semantic label map and an exemplary image.
In this paper, we tackle a more challenging and general task, where the exemplar is an arbitrary scene image that is semantically different from the given label map.
We propose an end-to-end network for joint global and local feature alignment and synthesis.
arXiv Detail & Related papers (2020-04-18T18:17:40Z) - Say As You Wish: Fine-grained Control of Image Caption Generation with
Abstract Scene Graphs [74.88118535585903]
We propose the Abstract Scene Graph structure to represent user intention in fine-grained level.
From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph.
Our model achieves better controllability conditioning on ASGs than carefully designed baselines on both VisualGenome and MSCOCO datasets.
arXiv Detail & Related papers (2020-03-01T03:34:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.