MLLMs-Augmented Visual-Language Representation Learning
- URL: http://arxiv.org/abs/2311.18765v3
- Date: Wed, 13 Mar 2024 08:47:32 GMT
- Title: MLLMs-Augmented Visual-Language Representation Learning
- Authors: Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou,
Kaipeng Zhang and Yang You
- Abstract summary: We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
- Score: 70.5293060238008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-language pre-training has achieved remarkable success in many
multi-modal tasks, largely attributed to the availability of large-scale
image-text datasets. In this work, we demonstrate that Multi-modal Large
Language Models (MLLMs) can enhance visual-language representation learning by
establishing richer image-text associations for image-text datasets. Our
approach is simple, utilizing MLLMs to extend multiple diverse captions for
each image. To prevent the bias introduced by MLLMs' hallucinations and
monotonous language styles, we propose "text shearing" to maintain the quality
and availability of extended captions. In image-text retrieval, without
introducing additional training cost, our method consistently obtains 5.6 ~
35.0 and 16.8 ~ 46.1 improvement on Recall@1 under the fine-tuning and
zero-shot settings, respectively. Notably, we obtain zero-shot results that are
comparable to fine-tuning on target datasets, which encourages more exploration
of the versatile use of MLLMs.
Related papers
- Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models [11.683093317651517]
Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification.
We present a simple yet effective approach for zero-shot image classification using multimodal LLMs.
Our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets.
arXiv Detail & Related papers (2024-05-24T16:05:15Z) - Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID [44.372336186832584]
We study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database.
We obtain substantial training data via Multi-modal Large Language Models (MLLMs)
We introduce a novel method that automatically identifies words in a description that do not correspond with the image.
arXiv Detail & Related papers (2024-05-08T10:15:04Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.