Towards the Human Global Context: Does the Vision-Language Model Really
Judge Like a Human Being?
- URL: http://arxiv.org/abs/2207.08333v1
- Date: Mon, 18 Jul 2022 01:01:43 GMT
- Title: Towards the Human Global Context: Does the Vision-Language Model Really
Judge Like a Human Being?
- Authors: Sangmyeong Woh, Jaemin Lee, Ho joong Kim and Jinsuk Lee
- Abstract summary: Vision-Language(VL) is becoming an important area of research.
We propose a quantitative metric "Equivariance Score" and evaluation dataset "Human Puzzle"
We aim to quantitatively measure a model's performance in understanding context.
- Score: 0.8889304968879164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As computer vision and NLP make progress, Vision-Language(VL) is becoming an
important area of research. Despite the importance, evaluation metrics of the
research domain is still at a preliminary stage of development. In this paper,
we propose a quantitative metric "Equivariance Score" and evaluation dataset
"Human Puzzle" to assess whether a VL model is understanding an image like a
human. We observed that the VL model does not interpret the overall context of
an input image but instead shows biases toward a specific object or shape that
forms the local context. We aim to quantitatively measure a model's performance
in understanding context. To verify the current existing VL model's capability,
we sliced the original input image into pieces and randomly placed them,
distorting the global context of the image. Our paper discusses each VL model's
level of interpretation on global context and addresses how the structural
characteristics influenced the results.
Related papers
- VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models [2.0718016474717196]
integrated Vision and Language Models (VLMs) are frequently regarded as black boxes within the machine learning research community.
We present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments.
We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the model's decision-making process.
arXiv Detail & Related papers (2024-10-06T20:11:53Z) - From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models [7.949705607963995]
vision language models (VLMs) have shown considerable advances in robotics applications.
We take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation.
We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings.
arXiv Detail & Related papers (2024-09-09T08:15:39Z) - CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding [2.524887615873207]
This study investigates the ability of various vision-language (VL) models to ground context-dependent verb phrases.
We introduce the CV-Probes dataset, containing image-caption pairs with context-dependent verbs.
We employ the MM-SHAP evaluation to assess the contribution of verb tokens towards model predictions.
arXiv Detail & Related papers (2024-09-02T17:39:26Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data.
We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths.
We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Understanding ME? Multimodal Evaluation for Fine-grained Visual
Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources.
We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge.
We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.