ImageBind: One Embedding Space To Bind Them All
- URL: http://arxiv.org/abs/2305.05665v2
- Date: Wed, 31 May 2023 04:57:12 GMT
- Title: ImageBind: One Embedding Space To Bind Them All
- Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan
Vasudev Alwala, Armand Joulin, Ishan Misra
- Abstract summary: ImageBind is an approach to learn a joint embedding across six different modalities.
We show that only image-paired data is sufficient to bind the modalities together.
- Score: 41.46167013891263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ImageBind, an approach to learn a joint embedding across six
different modalities - images, text, audio, depth, thermal, and IMU data. We
show that all combinations of paired data are not necessary to train such a
joint embedding, and only image-paired data is sufficient to bind the
modalities together. ImageBind can leverage recent large scale vision-language
models, and extends their zero-shot capabilities to new modalities just by
using their natural pairing with images. It enables novel emergent applications
'out-of-the-box' including cross-modal retrieval, composing modalities with
arithmetic, cross-modal detection and generation. The emergent capabilities
improve with the strength of the image encoder and we set a new
state-of-the-art on emergent zero-shot recognition tasks across modalities,
outperforming specialist supervised models. Finally, we show strong few-shot
recognition results outperforming prior work, and that ImageBind serves as a
new way to evaluate vision models for visual and non-visual tasks.
Related papers
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Multi-View Unsupervised Image Generation with Cross Attention Guidance [23.07929124170851]
This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets.
We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts.
Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
arXiv Detail & Related papers (2023-12-07T14:55:13Z) - FormNetV2: Multimodal Graph Contrastive Learning for Form Document
Information Extraction [43.17713130538514]
We introduce a centralized graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss.
FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
arXiv Detail & Related papers (2023-05-04T05:02:04Z) - Neural Congealing: Aligning Images to a Joint Semantic Atlas [14.348512536556413]
We present a zero-shot self-supervised framework for aligning semantically-common content across a set of images.
Our approach harnesses the power of pre-trained DINO-ViT features to learn.
We show that our method performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
arXiv Detail & Related papers (2023-02-08T09:26:22Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and
Novel-View Synthesis [39.53519330457627]
We propose a novel task of joint few-shot recognition and novel-view synthesis.
We aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints.
We focus on the interaction and cooperation between a generative model and a discriminative model.
arXiv Detail & Related papers (2020-08-16T19:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.