Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
- URL: http://arxiv.org/abs/2511.19200v2
- Date: Tue, 25 Nov 2025 10:49:13 GMT
- Title: Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
- Authors: Itay Cohen, Ethan Fetaya, Amir Rosenfeld,
- Abstract summary: We study whether vision-language models such as CLIP capture this distinction.<n>We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars.<n>Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval.
- Score: 10.10422216411379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
Related papers
- Discovering Divergent Representations between Text-to-Image Models [87.40710629963264]
We investigate when and how visual representations learned by two different generative models diverge.<n>We introduce CompCon, an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other.<n>We use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets.
arXiv Detail & Related papers (2025-09-10T19:07:55Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.<n>We finetune CLIP so that text descriptions of differences between images correspond to their difference in image embedding space.<n>Our approach yields significantly improved capabilities in ranking images by a certain attribute, and improved zeroshot classification performance on many downstream image classification tasks.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias [34.005902280160356]
We propose a novel framework to generate synthetic counterfactual images that can be used to fine-tune CLIP.
We show that our fine-tuned CLIP model, $CF_alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66% for image retrieval tasks.
arXiv Detail & Related papers (2024-06-17T08:42:19Z) - Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization [40.5076868823241]
We introduce a new dataset of adjacent image triplets obtained from a viewpoint trajectory.
We benchmark both semantic classification and pose estimation accuracies on the same visual feature.
Our experiments demonstrate that this approach helps develop a visual representation that encodes object identity.
arXiv Detail & Related papers (2024-03-22T06:04:11Z) - Seeing the Unseen: Visual Common Sense for Semantic Placement [71.76026880991245]
Given an image, a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans.
We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space)
arXiv Detail & Related papers (2024-01-15T15:28:30Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts [33.109305627550405]
This paper draws inspiration from the human visual perception process.
We propose a training-free, two-step zero-shot classification method PerceptionCLIP.
Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability.
arXiv Detail & Related papers (2023-08-02T17:57:25Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - Siamese Image Modeling for Self-Supervised Vision Representation
Learning [73.78790119050056]
Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks.
Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM)
We propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view.
arXiv Detail & Related papers (2022-06-02T17:59:58Z) - Adaptive Semantic-Visual Tree for Hierarchical Embeddings [67.01307058209709]
We propose a hierarchical adaptive semantic-visual tree to depict the architecture of merchandise categories.
The tree evaluates semantic similarities between different semantic levels and visual similarities within the same semantic class simultaneously.
At each level, we set different margins based on the semantic hierarchy and incorporate them as prior information to learn a fine-grained feature embedding.
arXiv Detail & Related papers (2020-03-08T03:36:42Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.