Related papers: ImageBind: One Embedding Space To Bind Them All

ImageBind: One Embedding Space To Bind Them All

URL: http://arxiv.org/abs/2305.05665v2
Date: Wed, 31 May 2023 04:57:12 GMT
Title: ImageBind: One Embedding Space To Bind Them All
Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
Abstract summary: ImageBind is an approach to learn a joint embedding across six different modalities. We show that only image-paired data is sufficient to bind the modalities together.
Score: 41.46167013891263
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Related papers

Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents [62.616106562146776]
We propose a textbfVisual-Centric textbfSelection approach via textbfAgents Collaboration (ViSA) Our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images.
arXiv Detail & Related papers (2025-02-27T09:37:30Z)
One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion [38.16599550115468]
We propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. The proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model.
arXiv Detail & Related papers (2025-02-27T07:55:19Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Multi-View Unsupervised Image Generation with Cross Attention Guidance [23.07929124170851]
This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
arXiv Detail & Related papers (2023-12-07T14:55:13Z)
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction [43.17713130538514]
We introduce a centralized graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
arXiv Detail & Related papers (2023-05-04T05:02:04Z)
Neural Congealing: Aligning Images to a Joint Semantic Atlas [14.348512536556413]
We present a zero-shot self-supervised framework for aligning semantically-common content across a set of images. Our approach harnesses the power of pre-trained DINO-ViT features to learn. We show that our method performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
arXiv Detail & Related papers (2023-02-08T09:26:22Z)
Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. We propose a novel Multi-modal Retrieval based framework (MoRe) MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z)
Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z)
Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis [39.53519330457627]
We propose a novel task of joint few-shot recognition and novel-view synthesis. We aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. We focus on the interaction and cooperation between a generative model and a discriminative model.
arXiv Detail & Related papers (2020-08-16T19:40:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.