Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild
- URL: http://arxiv.org/abs/2404.18459v3
- Date: Thu, 19 Dec 2024 08:47:07 GMT
- Title: Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild
- Authors: Donggyun Kim, Seongwoong Cho, Semin Kim, Chong Luo, Seunghoon Hong,
- Abstract summary: Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training.<n>However, constructing a data-efficient generalist for dense visual prediction presents a distinct challenge due to the variation in label structures across different tasks.<n>In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples.<n>We evaluate our model across a spectrum of unseen real-world scenarios where low-shot learning is desirable, including video, 3D, medical, biological, and user-interactive tasks.
- Score: 32.33035216140421
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training. However, constructing a data-efficient generalist for dense visual prediction presents a distinct challenge due to the variation in label structures across different tasks. Consequently, generalization to unseen dense prediction tasks in the low-data regime is not straightforward and has received less attention from previous vision generalists. In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples, enabling it to serve as a data-efficient vision generalist in diverse real-world scenarios. To this end, we base our method on a powerful meta-learning framework and explore several axes to improve its performance and versatility for real-world problems, such as flexible adaptation mechanisms and scalability. We evaluate our model across a spectrum of unseen real-world scenarios where low-shot learning is desirable, including video, 3D, medical, biological, and user-interactive tasks. Equipped with a generic architecture and an effective adaptation mechanism, our model flexibly adapts to all of these tasks with at most 50 labeled images, showcasing a significant advancement over existing data-efficient generalist approaches. Codes are available at https://github.com/GitGyun/chameleon.
Related papers
- Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z) - RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping [101.22617426879079]
We build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet.<n>The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data.<n>We propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target.
arXiv Detail & Related papers (2025-07-31T17:17:05Z) - Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets [17.82752126823939]
Real-world generalization requires taking into account the various complexities that can be encountered in the real-world.<n>We propose an open-set learning method for highly imbalanced medical datasets using a semi-supervised approach.<n>Our analysis shows that addressing the impact of long-tail data in classification significantly improves the overall performance of the network.
arXiv Detail & Related papers (2025-05-20T19:21:38Z) - Attribute-Based Robotic Grasping with Data-Efficient Adaptation [19.683833436076313]
We present an end-to-end encoder-decoder network to learn attribute-based robotic grasping.
Our approach achieves over 81% instance grasping success rate on unknown objects.
arXiv Detail & Related papers (2025-01-04T00:37:17Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z) - Data Factors for Better Compositional Generalization [60.698130703909804]
We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors.
We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges.
We explore how training examples of different difficulty levels influence generalization differently.
arXiv Detail & Related papers (2023-11-08T01:27:34Z) - RAHNet: Retrieval Augmented Hybrid Network for Long-tailed Graph
Classification [10.806893809269074]
We propose a novel framework called Retrieval Augmented Hybrid Network (RAHNet) to jointly learn a robust feature extractor and an unbiased classifier.
In the feature extractor training stage, we develop a graph retrieval module to search for relevant graphs that directly enrich the intra-class diversity for the tail classes.
We also innovatively optimize a category-centered supervised contrastive loss to obtain discriminative representations.
arXiv Detail & Related papers (2023-08-04T14:06:44Z) - FedBone: Towards Large-Scale Federated Multi-Task Learning [13.835972363413884]
In real-world applications, visual and natural language tasks typically require large-scale models to extract high-level abstract features.
Existing HFML methods disregard the impact of gradient conflicts on multi-task optimization.
We propose an innovative framework called FedBone, which enables the construction of large-scale models with better generalization.
arXiv Detail & Related papers (2023-06-30T08:19:38Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - SuperCone: Modeling Heterogeneous Experts with Concept Meta-learning for
Unified Predictive Segments System [8.917697023052257]
We present SuperCone, our unified predicative segments system.
It builds on top of a flat concept representation that summarizes each user's heterogeneous digital footprints.
It can outperform state-of-the-art recommendation and ranking algorithms on a wide range of predicative segment tasks.
arXiv Detail & Related papers (2022-03-09T04:11:39Z) - Does language help generalization in vision models? [0.0]
We show that a visual model trained on a very large supervised image dataset (ImageNet-21k) can be as efficient for generalization as its multimodal counterpart (CLIP)
When compared to other standard visual or language models, the latent representations of BiT-M were found to be just as "linguistic" as those of CLIP.
arXiv Detail & Related papers (2021-04-16T18:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.