Related papers: Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

URL: http://arxiv.org/abs/2407.15589v3
Date: Sat, 19 Oct 2024 03:59:31 GMT
Title: Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models
Authors: Amir Mohammad Karimi Mamaghan, Samuele Papa, Karl Henrik Johansson, Stefan Bauer, Andrea Dittadi,
Abstract summary: We conduct an empirical study on representation learning for downstream Visual Question Answering (VQA) We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches.
Score: 24.579822095003685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Object-centric (OC) representations, which represent the state of a visual scene by modeling it as a composition of objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have not been thoroughly analyzed yet. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains from language to computer vision, marking them as a potential cornerstone of future research for a multitude of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, and demonstrate a viable way to achieve the best of both worlds. The extensiveness of our study, encompassing over 600 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.

Related papers

Visual Bridge: Universal Visual Perception Representations Generating [27.034175361589572]
We propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks.<n>Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations.<n>Our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models.
arXiv Detail & Related papers (2025-11-11T06:25:30Z)
Representation Potentials of Foundation Models for Multimodal Alignment: A Survey [39.88306901879684]
Foundation models learn highly transferable representations through large-scale pretraining on diverse data.<n>We investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information.
arXiv Detail & Related papers (2025-10-05T21:48:51Z)
A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts [2.0391237204597368]
The framework focuses on fundamental concepts such as object recognition, absolute and relative positions, and attribute identification.<n>The proposed framework offers a valuable instrument for generating diverse and comprehensive datasets.
arXiv Detail & Related papers (2025-09-17T18:37:24Z)
Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.27548620675748]
Compositional visual reasoning has emerged as a key research frontier in multimodal AI.<n>This survey systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.)<n>We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception.
arXiv Detail & Related papers (2025-08-24T11:01:51Z)
Vision Generalist Model: A Survey [87.49797517847132]
We provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field.<n>We take a brief excursion into related domains, shedding light on their interconnections and potential synergies.
arXiv Detail & Related papers (2025-06-11T17:23:41Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence. Recent trends demonstrate the potential homogeneity of these two fields. We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z)
A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning [51.7818820745221]
Underwater image enhancement (UIE) presents a significant challenge within computer vision research. Despite the development of numerous UIE algorithms, a thorough and systematic review is still absent.
arXiv Detail & Related papers (2024-05-30T04:46:40Z)
Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models. Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z)
Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis. We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data. FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z)
One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category. We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z)
Unsupervised discovery of Interpretable Visual Concepts [0.0]
We propose two methods to explain a model's decision, enhancing global interpretability. One method is inspired by Occlusion and Sensitivity analysis (incorporating causality) The other method uses a novel metric, called Class-aware Order Correlation (CaOC), to globally evaluate the most important image regions.
arXiv Detail & Related papers (2023-08-31T07:53:02Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
Provably Learning Object-Centric Representations [25.152680199034215]
We analyze when object-centric representations can provably be learned without supervision. We prove that the ground-truth object representations can be identified by an invertible and compositional inference model. We provide evidence that our theory holds predictive power for existing object-centric models.
arXiv Detail & Related papers (2023-05-23T16:44:49Z)
Causal Reasoning Meets Visual Representation Learning: A Prospective Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z)
Deep Partial Multi-View Learning [94.39367390062831]
We propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets) We fifirst provide a formal defifinition of completeness and versatility for multi-view representation. We then theoretically prove the versatility of the learned latent representations.
arXiv Detail & Related papers (2020-11-12T02:29:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.