Instruct Me More! Random Prompting for Visual In-Context Learning
- URL: http://arxiv.org/abs/2311.03648v1
- Date: Tue, 7 Nov 2023 01:39:00 GMT
- Title: Instruct Me More! Random Prompting for Visual In-Context Learning
- Authors: Jiahao Zhang, Bowen Wang, Liangzhi Li, Yuta Nakashima, Hajime Nagahara
- Abstract summary: Instruct Me More (InMeMo) is a method that augments in-context pairs with a learnable perturbation (prompt) to explore its potential.
Our experiments on mainstream tasks reveal that InMeMo surpasses the current state-of-the-art performance.
Our findings suggest that InMeMo offers a versatile and efficient way to enhance the performance of visual ICL with lightweight training.
- Score: 30.31759752239964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale models trained on extensive datasets, have emerged as the
preferred approach due to their high generalizability across various tasks.
In-context learning (ICL), a popular strategy in natural language processing,
uses such models for different tasks by providing instructive prompts but
without updating model parameters. This idea is now being explored in computer
vision, where an input-output image pair (called an in-context pair) is
supplied to the model with a query image as a prompt to exemplify the desired
output. The efficacy of visual ICL often depends on the quality of the prompts.
We thus introduce a method coined Instruct Me More (InMeMo), which augments
in-context pairs with a learnable perturbation (prompt), to explore its
potential. Our experiments on mainstream tasks reveal that InMeMo surpasses the
current state-of-the-art performance. Specifically, compared to the baseline
without learnable prompt, InMeMo boosts mIoU scores by 7.35 and 15.13 for
foreground segmentation and single object detection tasks, respectively. Our
findings suggest that InMeMo offers a versatile and efficient way to enhance
the performance of visual ICL with lightweight training. Code is available at
https://github.com/Jackieam/InMeMo.
Related papers
- Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Learning Robust Visual-Semantic Embedding for Generalizable Person
Re-identification [11.562980171753162]
Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision.
Previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training.
We propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning.
arXiv Detail & Related papers (2023-04-19T08:37:25Z) - Meta Learning to Bridge Vision and Language Models for Multimodal
Few-Shot Learning [38.37682598345653]
We introduce a multimodal meta-learning approach to bridge the gap between vision and language models.
We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models.
We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words.
arXiv Detail & Related papers (2023-02-28T17:46:18Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.