MIVC: Multiple Instance Visual Component for Visual-Language Models
- URL: http://arxiv.org/abs/2312.17109v1
- Date: Thu, 28 Dec 2023 16:33:32 GMT
- Title: MIVC: Multiple Instance Visual Component for Visual-Language Models
- Authors: Wenyi Wu, Qi Li, Wenliang Zhong, Junzhou Huang
- Abstract summary: We propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models.
We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks.
- Score: 46.869139462026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models have been widely explored across a wide range of tasks
and achieve satisfactory performance. However, it's under-explored how to
consolidate entity understanding through a varying number of images and to
align it with the pre-trained language models for generative tasks. In this
paper, we propose MIVC, a general multiple instance visual component to bridge
the gap between various image inputs with off-the-shelf vision-language models
by aggregating visual representations in a permutation-invariant fashion
through a neural network. We show that MIVC could be plugged into the
visual-language models to improve the model performance consistently on visual
question answering, classification and captioning tasks on a public available
e-commerce dataset with multiple images per product. Furthermore, we show that
the component provides insight into the contribution of each image to the
downstream tasks.
Related papers
- OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects [2.850097504458451]
We introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images.
We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models.
arXiv Detail & Related papers (2024-10-02T06:14:49Z) - Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models.
Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works.
Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z) - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [55.65727739645824]
Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos.
We employ a set of dynamic visual tokens to uniformly represent images and videos.
We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
arXiv Detail & Related papers (2023-11-14T10:11:36Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model.
Our model achieves new levels of performance on a wide-range of varied and complex tasks.
We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z) - Images in Language Space: Exploring the Suitability of Large Language
Models for Vision & Language Tasks [17.97052348690598]
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms.
multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models.
We make visual information accessible to the language model using separate verbalisation models.
arXiv Detail & Related papers (2023-05-23T07:50:36Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.