Equivariant Similarity for Vision-Language Foundation Models
- URL: http://arxiv.org/abs/2303.14465v2
- Date: Mon, 9 Oct 2023 16:55:08 GMT
- Title: Equivariant Similarity for Vision-Language Foundation Models
- Authors: Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang,
Hanwang Zhang, Zicheng Liu, Lijuan Wang
- Abstract summary: This study focuses on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks.
We propose EqSim, a regularization loss that can be efficiently calculated from any two matched training pairs.
Compared to the existing evaluation sets, EqBen is the first to focus on "visual-minimal change"
- Score: 134.77524524140168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study explores the concept of equivariance in vision-language foundation
models (VLMs), focusing specifically on the multimodal similarity function that
is not only the major training objective but also the core delivery to support
downstream tasks. Unlike the existing image-text similarity objective which
only categorizes matched pairs as similar and unmatched pairs as dissimilar,
equivariance also requires similarity to vary faithfully according to the
semantic changes. This allows VLMs to generalize better to nuanced and unseen
multimodal compositions. However, modeling equivariance is challenging as the
ground truth of semantic change is difficult to collect. For example, given an
image-text pair about a dog, it is unclear to what extent the similarity
changes when the pixel is changed from dog to cat? To this end, we propose
EqSim, a regularization loss that can be efficiently calculated from any two
matched training pairs and easily pluggable into existing image-text retrieval
fine-tuning. Meanwhile, to further diagnose the equivariance of VLMs, we
present a new challenging benchmark EqBen. Compared to the existing evaluation
sets, EqBen is the first to focus on "visual-minimal change". Extensive
experiments show the lack of equivariance in current VLMs and validate the
effectiveness of EqSim. Code is available at https://github.com/Wangt-CN/EqBen.
Related papers
- Relaxed Equivariance via Multitask Learning [7.905957228045955]
We introduce REMUL, a training procedure for approximating equivariance with multitask learning.
We show that unconstrained models can learn approximate symmetries by minimizing an additional simple equivariance loss.
Our method achieves competitive performance compared to equivariant baselines while being $10 times$ faster at inference and $2.5 times$ at training.
arXiv Detail & Related papers (2024-10-23T13:50:27Z) - Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - CARL: A Framework for Equivariant Image Registration [17.976933318883333]
Image registration estimates spatial correspondences between a pair of images.
Formally, the estimator should be equivariant to a desired class of image transformations.
We show how to achieve multi-step $[W,U]$ equivariance via a coordinate-attention mechanism combined with displacement-predicting refinement layers.
arXiv Detail & Related papers (2024-05-27T01:06:58Z) - Self-Supervised Learning for Group Equivariant Neural Networks [75.62232699377877]
Group equivariant neural networks are the models whose structure is restricted to commute with the transformations on the input.
We propose two concepts for self-supervised tasks: equivariant pretext labels and invariant contrastive loss.
Experiments on standard image recognition benchmarks demonstrate that the equivariant neural networks exploit the proposed self-supervised tasks.
arXiv Detail & Related papers (2023-03-08T08:11:26Z) - The Lie Derivative for Measuring Learned Equivariance [84.29366874540217]
We study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures.
We find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities.
For example, transformers can be more equivariant than convolutional neural networks after training.
arXiv Detail & Related papers (2022-10-06T15:20:55Z) - Quantised Transforming Auto-Encoders: Achieving Equivariance to
Arbitrary Transformations in Deep Networks [23.673155102696338]
Convolutional Neural Networks (CNNs) are equivariant to image translation.
We propose an auto-encoder architecture whose embedding obeys an arbitrary set of equivariance relations simultaneously.
We demonstrate results of successful re-rendering of transformed versions of input images on several datasets.
arXiv Detail & Related papers (2021-11-25T02:26:38Z) - Semantic Distribution-aware Contrastive Adaptation for Semantic
Segmentation [50.621269117524925]
Domain adaptive semantic segmentation refers to making predictions on a certain target domain with only annotations of a specific source domain.
We present a semantic distribution-aware contrastive adaptation algorithm that enables pixel-wise representation alignment.
We evaluate SDCA on multiple benchmarks, achieving considerable improvements over existing algorithms.
arXiv Detail & Related papers (2021-05-11T13:21:25Z) - Unsupervised Feature Learning by Cross-Level Instance-Group
Discrimination [68.83098015578874]
We integrate between-instance similarity into contrastive learning, not directly by instance grouping, but by cross-level discrimination.
CLD effectively brings unsupervised learning closer to natural data and real-world applications.
New state-of-the-art on self-supervision, semi-supervision, and transfer learning benchmarks, and beats MoCo v2 and SimCLR on every reported performance.
arXiv Detail & Related papers (2020-08-09T21:13:13Z) - Demystifying Contrastive Self-Supervised Learning: Invariances,
Augmentations and Dataset Biases [34.02639091680309]
Recent gains in performance come from training instance classification models, treating each image and it's augmented versions as samples of a single class.
We demonstrate that approaches like MOCO and PIRL learn occlusion-invariant representations.
Second, we demonstrate that these approaches obtain further gains from access to a clean object-centric training dataset like Imagenet.
arXiv Detail & Related papers (2020-07-28T00:11:31Z) - Scale Equivariance Improves Siamese Tracking [1.7188280334580197]
Siamese trackers turn tracking into similarity estimation between a template and the candidate regions in the frame.
Non-translation-equivariant architectures induce a positional bias during training.
We present SE-SiamFC, a scale-equivariant variant of SiamFC built according to the recipe.
arXiv Detail & Related papers (2020-07-17T16:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.