VIRTUE: Visual-Interactive Text-Image Universal Embedder
- URL: http://arxiv.org/abs/2510.00523v1
- Date: Wed, 01 Oct 2025 05:11:54 GMT
- Title: VIRTUE: Visual-Interactive Text-Image Universal Embedder
- Authors: Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji,
- Abstract summary: Existing embedding models lack visual-interactive capabilities to specify regions of interest from users.<n>We propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning.
- Score: 35.87241345903579
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.
Related papers
- Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains [31.828341309787042]
Vision-language models (VLMs) achieve remarkable success in single-image tasks.<n>Real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline.<n>We propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios.
arXiv Detail & Related papers (2025-04-28T19:02:18Z) - Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model [19.091907959433073]
We introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data.<n>We propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation.<n>We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning.
arXiv Detail & Related papers (2025-04-28T12:36:14Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.<n>We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.<n>Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models.
Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works.
Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z) - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [55.65727739645824]
Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos.
We employ a set of dynamic visual tokens to uniformly represent images and videos.
We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
arXiv Detail & Related papers (2023-11-14T10:11:36Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning [22.464424641734652]
Cross-modal alignment is essential for vision-language pre-training models.
We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment.
We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
arXiv Detail & Related papers (2022-11-24T06:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.