One-Shot Open Affordance Learning with Foundation Models
- URL: http://arxiv.org/abs/2311.17776v1
- Date: Wed, 29 Nov 2023 16:23:06 GMT
- Title: One-Shot Open Affordance Learning with Foundation Models
- Authors: Gen Li, Deqing Sun, Laura Sevilla-Lara, Varun Jampani
- Abstract summary: We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
- Score: 54.15857111929812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce One-shot Open Affordance Learning (OOAL), where a model is
trained with just one example per base object category, but is expected to
identify novel objects and affordances. While vision-language models excel at
recognizing novel objects and scenes, they often struggle to understand finer
levels of granularity such as affordances. To handle this issue, we conduct a
comprehensive analysis of existing foundation models, to explore their inherent
understanding of affordances and assess the potential for data-limited
affordance learning. We then propose a vision-language framework with simple
and effective designs that boost the alignment between visual features and
affordance text embeddings. Experiments on two affordance segmentation
benchmarks show that the proposed method outperforms state-of-the-art models
with less than 1% of the full training data, and exhibits reasonable
generalization capability on unseen objects and affordances.
Related papers
- Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models [24.579822095003685]
We conduct an empirical study on representation learning for downstream Visual Question Answering (VQA)
We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches.
arXiv Detail & Related papers (2024-07-22T12:26:08Z) - Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification [49.41632476658246]
We discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets.
The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts.
We propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles.
arXiv Detail & Related papers (2024-07-21T13:26:30Z) - Few Shot Class Incremental Learning using Vision-Language models [24.930246674021525]
In this study, we introduce an innovative few-shot class incremental learning (FSCIL) framework that utilizes language regularizer and subspace regularizer.
Our proposed framework not only empowers the model to embrace novel classes with limited data, but also ensures the preservation of performance on base classes.
arXiv Detail & Related papers (2024-05-02T06:52:49Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Low-shot Object Learning with Mutual Exclusivity Bias [27.67152913041082]
This paper introduces Low-shot Object Learning with Mutual Exclusivity Bias (LSME), the first computational framing of mutual exclusivity bias.
We provide a novel dataset, comprehensive baselines, and a state-of-the-art method to enable the ML community to tackle this challenging learning task.
arXiv Detail & Related papers (2023-12-06T14:54:10Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept.
We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - An Empirical Investigation of Representation Learning for Imitation [76.48784376425911]
Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data.
We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation.
arXiv Detail & Related papers (2022-05-16T11:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.