MATE: Meet At The Embedding -- Connecting Images with Long Texts
- URL: http://arxiv.org/abs/2407.09541v1
- Date: Wed, 26 Jun 2024 14:10:00 GMT
- Title: MATE: Meet At The Embedding -- Connecting Images with Long Texts
- Authors: Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim,
- Abstract summary: Meet At The Embedding (MATE) is a novel approach that combines the capabilities of Large Language Models (LLMs) with Vision Language Models (VLMs)
We replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts.
We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts.
- Score: 37.27283238166393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the embeddings from the VLM text encoder with those from the LLM using extensive text pairs. This module is then employed to seamlessly align image embeddings closely with LLM embeddings. We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts (lengthy captions / documents). Extensive experimental results demonstrate that MATE effectively connects images with long texts, uncovering diverse semantic relationships.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.
Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE.
It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts.
Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z) - FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion [7.322448493179106]
Flow Text with Image Insertion task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation.
We introduce the Flow Text with Image Insertion Benchmark (FTII-Bench), which includes 318 high-quality Chinese image-text news articles and 307 high-quality English image-text news articles, covering 10 different news domains.
We evaluate 9 open-source and 2 closed-source LVLMs as well as 2 CLIP-based models.
arXiv Detail & Related papers (2024-10-16T13:38:31Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Wings: Learning Multimodal LLMs without Text-only Forgetting [63.56085426442873]
Wings is a novel MLLM that excels in both text-only dialogues and multimodal comprehension.
Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks.
arXiv Detail & Related papers (2024-06-05T17:59:40Z) - LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO.
We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images.
Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.