fruit-SALAD: A Style Aligned Artwork Dataset to reveal similarity perception in image embeddings
- URL: http://arxiv.org/abs/2406.01278v1
- Date: Mon, 3 Jun 2024 12:47:48 GMT
- Title: fruit-SALAD: A Style Aligned Artwork Dataset to reveal similarity perception in image embeddings
- Authors: Tillmann Ohm, Andres Karjus, Mikhail Tamm, Maximilian Schich,
- Abstract summary: We introduce Style Aligned Artwork datasets (SALADs)
This combined semantic category and style benchmark comprises 100 instances each of 10 easy-to-recognize fruit categories, across 10 easy distinguishable styles.
The SALAD framework allows the comparison of how these models perform semantic category and style recognition task to go beyond the level of anecdotal knowledge.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The notion of visual similarity is essential for computer vision, and in applications and studies revolving around vector embeddings of images. However, the scarcity of benchmark datasets poses a significant hurdle in exploring how these models perceive similarity. Here we introduce Style Aligned Artwork Datasets (SALADs), and an example of fruit-SALAD with 10,000 images of fruit depictions. This combined semantic category and style benchmark comprises 100 instances each of 10 easy-to-recognize fruit categories, across 10 easy distinguishable styles. Leveraging a systematic pipeline of generative image synthesis, this visually diverse yet balanced benchmark demonstrates salient differences in semantic category and style similarity weights across various computational models, including machine learning models, feature extraction algorithms, and complexity measures, as well as conceptual models for reference. This meticulously designed dataset offers a controlled and balanced platform for the comparative analysis of similarity perception. The SALAD framework allows the comparison of how these models perform semantic category and style recognition task to go beyond the level of anecdotal knowledge, making it robustly quantifiable and qualitatively interpretable.
Related papers
- Safire: Similarity Framework for Visualization Retrieval [7.468307805375097]
We introduce the Similarity Framework for Visualization Retrieval (Safire), a conceptual model that frames visualization similarity along two dimensions: comparison criteria and representation modalities.<n>We analyze several visualization retrieval systems using Safire to demonstrate its practical value in clarifying similarity considerations.
arXiv Detail & Related papers (2025-10-18T23:11:40Z) - Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z) - Multimodal Representation Alignment for Cross-modal Information Retrieval [12.42313654539524]
Different machine learning models can represent the same underlying concept in different ways.<n>This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another as input.<n>In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models.<n>We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks.
arXiv Detail & Related papers (2025-06-10T13:16:26Z) - DiffSim: Taming Diffusion Models for Evaluating Visual Similarity [19.989551230170584]
This paper introduces the DiffSim method to measure visual similarity in generative models.
By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity.
We also introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance.
arXiv Detail & Related papers (2024-12-19T07:00:03Z) - Training objective drives the consistency of representational similarity across datasets [19.99817888941361]
The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance.
Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations.
We find that the objective function is the most crucial factor in determining the consistency of representational similarities across datasets.
arXiv Detail & Related papers (2024-11-08T13:35:45Z) - Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods [4.431754853927668]
In cinema, visual motifs are recurrent iconographic compositions that carry artistic or aesthetic significance.
Our goal is to recognise and classify these motifs by proposing a new machine learning model that uses a custom dataset to that end.
We show how features extracted from a CLIP model can be leveraged by using a shallow network and an appropriate loss to classify images into 20 different motifs, with surprisingly good results.
arXiv Detail & Related papers (2024-10-21T10:50:00Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - Self-similarity Driven Scale-invariant Learning for Weakly Supervised
Person Search [66.95134080902717]
We propose a novel one-step framework, named Self-similarity driven Scale-invariant Learning (SSL)
We introduce a Multi-scale Exemplar Branch to guide the network in concentrating on the foreground and learning scale-invariant features.
Experiments on PRW and CUHK-SYSU databases demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-02-25T04:48:11Z) - Learning an Adaptation Function to Assess Image Visual Similarities [0.0]
We focus here on the specific task of learning visual image similarities when analogy matters.
We propose to compare different supervised, semi-supervised and self-supervised networks, pre-trained on distinct scales and contents datasets.
Our experiments conducted on the Totally Looks Like image dataset highlight the interest of our method, by increasing the retrieval scores of the best model @1 by 2.25x.
arXiv Detail & Related papers (2022-06-03T07:15:00Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - IMACS: Image Model Attribution Comparison Summaries [16.80986701058596]
We introduce IMACS, a method that combines gradient-based model attributions with aggregation and visualization techniques.
IMACS extracts salient input features from an evaluation dataset, clusters them based on similarity, then visualizes differences in model attributions for similar input features.
We show how our technique can uncover behavioral differences caused by domain shift between two models trained on satellite images.
arXiv Detail & Related papers (2022-01-26T21:35:14Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Image Synthesis via Semantic Composition [74.68191130898805]
We present a novel approach to synthesize realistic images based on their semantic layouts.
It hypothesizes that for objects with similar appearance, they share similar representation.
Our method establishes dependencies between regions according to their appearance correlation, yielding both spatially variant and associated representations.
arXiv Detail & Related papers (2021-09-15T02:26:07Z) - Towards Visually Explaining Similarity Models [29.704524987493766]
We present a method to generate gradient-based visual attention for image similarity predictors.
By relying solely on the learned feature embedding, we show that our approach can be applied to any kind of CNN-based similarity architecture.
We show that our resulting attention maps serve more than just interpretability; they can be infused into the model learning process itself with new trainable constraints.
arXiv Detail & Related papers (2020-08-13T17:47:41Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.