Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
- URL: http://arxiv.org/abs/2109.06860v1
- Date: Tue, 14 Sep 2021 17:52:55 GMT
- Title: Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
- Authors: Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, Kai-Wei Chang
- Abstract summary: We construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense.
We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region.
- Score: 49.04866469947569
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Commonsense is defined as the knowledge that is shared by everyone. However,
certain types of commonsense knowledge are correlated with culture and
geographic locations and they are only shared locally. For example, the
scenarios of wedding ceremonies vary across regions due to different customs
influenced by historical and religious factors. Such regional characteristics,
however, are generally omitted in prior work. In this paper, we construct a
Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test
vision-and-language models' ability to understand cultural and
geo-location-specific commonsense. In particular, we study two state-of-the-art
Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard
multimodal commonsense benchmark with images primarily from Western regions. We
then evaluate how well the trained models can generalize to answering the
questions in GD-VCR. We find that the performance of both models for
non-Western regions including East Asia, South Asia, and Africa is
significantly lower than that for Western region. We analyze the reasons behind
the performance disparity and find that the performance gap is larger on QA
pairs that: 1) are concerned with culture-related scenarios, e.g., weddings,
religious activities, and festivals; 2) require high-level geo-diverse
commonsense reasoning rather than low-order perception and recognition. Dataset
and code are released at https://github.com/WadeYin9712/GD-VCR.
Related papers
- Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - IndoCulture: Exploring Geographically-Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces [28.21857463550941]
We introduce IndoCulture, aimed at understanding the influence of geographical factors on language model reasoning ability.
We ask local people to manually develop a cultural context and plausible options, across a set of predefined topics.
Open-weight Llama-3 is competitive with GPT-4, while other open-weight models struggle, with accuracies below 50%.
arXiv Detail & Related papers (2024-04-02T11:32:58Z) - Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for
Cross-City Semantic Segmentation using High-Resolution Domain Adaptation
Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task.
Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments.
HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z) - Does Progress On Object Recognition Benchmarks Improve Real-World
Generalization? [9.906591021385303]
Researchers have measured progress in object recognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C, and -R for more than a decade.
Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate these standard benchmarks, but remain brittle in practice.
We propose studying generalization across geography as a more realistic measure of progress using two datasets of objects from households across the globe.
arXiv Detail & Related papers (2023-07-24T21:29:48Z) - GeoNet: Benchmarking Unsupervised Adaptation across Geographies [71.23141626803287]
We study the problem of geographic robustness and make three main contributions.
First, we introduce a large-scale dataset GeoNet for geographic adaptation.
Second, we hypothesize that the major source of domain shifts arise from significant variations in scene context.
Third, we conduct an extensive evaluation of several state-of-the-art unsupervised domain adaptation algorithms and architectures.
arXiv Detail & Related papers (2023-03-27T17:59:34Z) - Where We Are and What We're Looking At: Query Based Worldwide Image
Geo-localization Using Hierarchies and Scenes [53.53712888703834]
We introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels.
We achieve state of the art street level accuracy on 4 standard geo-localization datasets.
arXiv Detail & Related papers (2023-03-07T21:47:58Z) - GIVL: Improving Geographical Inclusivity of Vision-Language Models with
Pre-Training Methods [62.076647211744564]
We propose GIVL, a Geographically Inclusive Vision-and-Language Pre-trained model.
There are two attributes of geo-diverse visual concepts which can help to learn geo-diverse knowledge: 1) concepts under similar categories have unique knowledge and visual characteristics, 2) concepts with similar visual features may fall in completely different categories.
Compared with similar-size models pre-trained with similar scale of data, GIVL achieves state-of-the-art (SOTA) and more balanced performance on geo-diverse V&L tasks.
arXiv Detail & Related papers (2023-01-05T03:43:45Z) - Measuring Geographic Performance Disparities of Offensive Language
Classifiers [12.545108947857802]
We ask two questions: Does language, dialect, and topical content vary across geographical regions?'' and If there are differences across the regions, do they impact model performance?''
We find that current models do not generalize across locations. Likewise, we show that while offensive language models produce false positives on African American English, model performance is not correlated with each city's minority population proportions.
arXiv Detail & Related papers (2022-09-15T15:08:18Z) - Investigating Cultural Aspects in the Fundamental Diagram using
Convolutional Neural Networks and Simulation [0.0]
This paper focuses on differences in an important attribute that vary across cultures -- the personal spaces -- in Brazil and Germany.
We use CNNs to detect and track people in video sequences and Voronoi Diagrams to find out the neighbor relation among people.
Based on personal spaces analyses, we found out that people behavior is more similar, in terms of their behaviours, in high dense populations and vary more in low and medium densities.
arXiv Detail & Related papers (2020-09-30T14:44:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.