Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge
- URL: http://arxiv.org/abs/2601.18698v1
- Date: Mon, 26 Jan 2026 17:14:57 GMT
- Title: Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge
- Authors: Xiao Liu, Jiawei Zhang,
- Abstract summary: We investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation.<n>We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions.<n>GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments.
- Score: 10.25143146695869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.
Related papers
- Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z) - GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings [3.43519422766841]
We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation.<n>Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets.
arXiv Detail & Related papers (2025-10-01T20:39:48Z) - GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains [20.788130896943663]
Geo Reason Enhancement (GRE) Suite is a novel framework that augments Visual Language Models with structured reasoning chains for interpretable location inference.<n>First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis.<n>Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision.
arXiv Detail & Related papers (2025-05-24T13:48:57Z) - GaGA: Towards Interactive Global Geolocation Assistant [20.342366228855735]
GaGA is an interactive global geolocation assistant built upon the flourishing large vision-language models (LVLMs)<n>It uncovers geographical clues within images and combines them with the extensive world knowledge embedded in LVLMs to determine the geolocations.<n>GaGA achieves state-of-the-art performance on the GWS15k dataset, improving accuracy by 4.57% at the country level and 2.92% at the city level.
arXiv Detail & Related papers (2024-12-12T03:39:44Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - Measuring Geographic Diversity of Foundation Models with a Natural Language--based Geo-guessing Experiment on GPT-4 [5.534517268996598]
We study GPT-4, a state-of-the-art representative in the family of multimodal large language models, to study its geographic diversity.
Using DBpedia abstracts as a ground-truth corpus for probing, our natural language-based geo-guessing experiment shows that GPT-4 may currently encode insufficient knowledge about several geographic feature types.
arXiv Detail & Related papers (2024-04-11T09:59:21Z) - Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition [24.701574433327746]
We investigate the feasibility of probing a large language model for geography-based object knowledge.
We propose geography knowledge regularization to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set.
Accuracy gains over prompting baselines on DollarStreet are up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas, and +4.6 overall on the hardest classes.
arXiv Detail & Related papers (2024-01-03T01:11:16Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - GeoNet: Benchmarking Unsupervised Adaptation across Geographies [71.23141626803287]
We study the problem of geographic robustness and make three main contributions.
First, we introduce a large-scale dataset GeoNet for geographic adaptation.
Second, we hypothesize that the major source of domain shifts arise from significant variations in scene context.
Third, we conduct an extensive evaluation of several state-of-the-art unsupervised domain adaptation algorithms and architectures.
arXiv Detail & Related papers (2023-03-27T17:59:34Z) - GIVL: Improving Geographical Inclusivity of Vision-Language Models with
Pre-Training Methods [62.076647211744564]
We propose GIVL, a Geographically Inclusive Vision-and-Language Pre-trained model.
There are two attributes of geo-diverse visual concepts which can help to learn geo-diverse knowledge: 1) concepts under similar categories have unique knowledge and visual characteristics, 2) concepts with similar visual features may fall in completely different categories.
Compared with similar-size models pre-trained with similar scale of data, GIVL achieves state-of-the-art (SOTA) and more balanced performance on geo-diverse V&L tasks.
arXiv Detail & Related papers (2023-01-05T03:43:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.