Granular Privacy Control for Geolocation with Vision Language Models
- URL: http://arxiv.org/abs/2407.04952v2
- Date: Thu, 17 Oct 2024 14:58:53 GMT
- Title: Granular Privacy Control for Geolocation with Vision Language Models
- Authors: Ethan Mendes, Yang Chen, James Hays, Sauvik Das, Wei Xu, Alan Ritter,
- Abstract summary: We develop a new benchmark, GPTGeoChat, to test the ability of Vision Language Models to moderate geolocation dialogues with users.
We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v.
We evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed.
- Score: 36.3455665044992
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolocate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geolocators, making widespread geolocation with VLMs an immediate privacy risk, rather than merely a theoretical future concern. As a first step to address this challenge, we develop a new benchmark, GPTGeoChat, to test the ability of VLMs to moderate geolocation dialogues with users. We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v, which are annotated with the granularity of location information revealed at each turn. Using this new dataset, we evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed. We find that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level; however, fine-tuning on supervised data appears to be needed to accurately moderate finer granularities, such as the name of a restaurant or building.
Related papers
- Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - Image-Based Geolocation Using Large Vision-Language Models [19.071551941682063]
We introduce tool, an innovative framework that significantly enhances image-based geolocation accuracy.
tool employs a systematic chain-of-thought (CoT) approach, mimicking human geoguessing strategies.
It achieves an impressive average score of 4550.5 in the GeoGuessr game, with an 85.37% win rate, and delivers highly precise geolocation predictions.
arXiv Detail & Related papers (2024-08-18T13:39:43Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model [6.135404769437841]
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM)
Existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference.
To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable.
To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities.
arXiv Detail & Related papers (2024-06-03T18:08:56Z) - PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs [55.8550939439138]
Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems.
These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions.
We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM.
Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
arXiv Detail & Related papers (2024-02-13T18:39:18Z) - Good at captioning, bad at counting: Benchmarking GPT-4V on Earth
observation data [7.797577465015058]
We propose a benchmark to gauge the progress of Large Vision-Language Models (VLMs) toward being useful tools for Earth observation data.
Motivated by real-world applications, our benchmark includes scenarios like urban monitoring, disaster relief, land use, and conservation.
Our benchmark will be made publicly available at https://vleo.danielz.ch/ and on Hugging Face at https://huggingface.co/collections/mit-ei/vleo-benchmark-datasets-65b789b0466555489cce0d70.
arXiv Detail & Related papers (2024-01-31T04:57:12Z) - GeoLocator: a location-integrated large multimodal model for inferring
geo-privacy [6.7452045691798945]
This study develops a location-integrated GPT-4 based model named GeoLocator.
Experiments reveal that GeoLocator generates specific geographic details with high accuracy.
We conclude with the broader implications of GeoLocator and our findings for individuals and the community at large.
arXiv Detail & Related papers (2023-11-21T21:48:51Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - GeoCLIP: Clip-Inspired Alignment between Locations and Images for
Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth.
Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task.
We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.