Related papers: Synthesis and Perceptual Scaling of High Resolution Naturalistic Images Using Stable Diffusion

Synthesis and Perceptual Scaling of High Resolution Naturalistic Images Using Stable Diffusion

URL: http://arxiv.org/abs/2410.13034v2
Date: Wed, 17 Sep 2025 16:19:18 GMT
Title: Synthesis and Perceptual Scaling of High Resolution Naturalistic Images Using Stable Diffusion
Authors: Leonardo Pettini, Carsten Bogler, Christian Doeller, John-Dylan Haynes,
Abstract summary: We create a stimulus set of photorealistic images characterized by gradual transitions.<n>For each object scene, we generate 10 variants that are ordered along a perceptual continuum.<n>This ordering is also predictive of confusability of stimuli in a working memory experiment.
Score: 0.43748379918040853
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Naturalistic scenes are of key interest for visual perception, but controlling their perceptual and semantic properties is challenging. Previous work on naturalistic scenes has frequently focused on collections of discrete images with considerable physical differences between stimuli. However, it is often desirable to assess representations of naturalistic images that vary along a continuum. Traditionally, perceptually continuous variations of naturalistic stimuli have been obtained by morphing a source image into a target image. This produces transitions driven mainly by low-level physical features and can result in semantically ambiguous outcomes. More recently, generative adversarial networks (GANs) have been used to generate continuous perceptual variations within a stimulus category. Here we extend and generalize this approach using a different machine learning approach, a text-to-image diffusion model (Stable Diffusion XL), to generate a freely customizable stimulus set of photorealistic images that are characterized by gradual transitions, with each image representing a unique exemplar within a prompted category. We demonstrate the approach by generating a set of 108 object scenes from 6 categories. For each object scene, we generate 10 variants that are ordered along a perceptual continuum. This ordering was first estimated using a machine learning model of perceptual similarity (LPIPS) and then subsequently validated with a large online sample of human participants. In a subsequent experiment we show that this ordering is also predictive of confusability of stimuli in a working memory experiment. Our image set is suited for studies investigating the graded encoding of naturalistic stimuli in visual perception, attention, and memory.

Related papers

Detecting Generated Images by Fitting Natural Image Distributions [75.31113784234877]
We propose a novel framework that exploits geometric differences between the data manifold of natural and generated images.<n>We employ a pair of functions engineered to yield consistent outputs for natural images but divergent outputs for generated ones.<n>An image is identified as generated if a transformation along its data manifold induces a significant change in the loss value of a self-supervised model pre-trained on natural images.
arXiv Detail & Related papers (2025-11-03T07:20:38Z)
MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models [73.20126092411776]
We conduct the first systematic study of hallucinations in multi-image MLLMs.<n>We propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images.<n>MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination.
arXiv Detail & Related papers (2025-08-01T15:49:29Z)
Hidden Bias in the Machine: Stereotypes in Text-to-Image Models [0.0]
Text-to-Image (T2I) models have transformed visual content creation, producing highly realistic images from natural language prompts.<n>We curated a diverse set of prompts spanning thematic categories such as occupations, traits, actions, ideologies, emotions, family roles, place descriptions, spirituality, and life events.<n>For each of the 160 unique topics, we crafted multiple prompt variations to reflect a wide range of meanings and perspectives.<n>Our analysis reveals significant disparities in the representation of gender, race, age, somatotype, and other human-centric factors across generated images.
arXiv Detail & Related papers (2025-06-09T23:06:04Z)
An Image-like Diffusion Method for Human-Object Interaction Detection [13.951650101149188]
The output of HOI detection for each human-object pair can be recast as an image.<n>In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images.
arXiv Detail & Related papers (2025-03-23T16:30:16Z)
Estimating the distribution of numerosity and non-numerical visual magnitudes in natural scenes using computer vision [0.08192907805418582]
We show that in natural visual scenes the frequency of appearance of different numerosities follows a power law distribution. We show that the correlational structure for numerosity and continuous magnitudes is stable across datasets and scene types.
arXiv Detail & Related papers (2024-09-17T09:49:29Z)
Image Segmentation via Divisive Normalization: dealing with environmental diversity [0.8796261172196743]
We put segmentation U-nets augmented with Divisive Normalization to work far from training conditions. We categorize scenes according to their radiance level and dynamic range (day/night), and according to their achromatic/chromatic contrasts. Results show that neural networks with Divisive Normalization get better results in all the scenarios.
arXiv Detail & Related papers (2024-07-25T07:38:27Z)
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images. Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z)
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior [50.0535198082903]
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. We showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition.
arXiv Detail & Related papers (2024-07-06T03:35:43Z)
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z)
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z)
Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images [34.02058539403381]
We leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images.
arXiv Detail & Related papers (2024-03-13T19:56:30Z)
Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes [4.518404103861656]
We study the nature of variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.
arXiv Detail & Related papers (2024-02-02T12:11:16Z)
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z)
Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion [6.491645162078057]
Text-to-image (TTI) systems have made it possible to create realistic images with simple text prompts. In all of the experiments performed to date, classifiers trained solely with synthetic images perform poorly at inference. We find four issues that limit the usefulness of TTI systems for this task: ambiguity, adherence to prompt, lack of diversity, and inability to represent the underlying concept.
arXiv Detail & Related papers (2023-10-31T18:05:15Z)
Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z)
Multi-Domain Norm-referenced Encoding Enables Data Efficient Transfer Learning of Facial Expression Recognition [62.997667081978825]
We propose a biologically-inspired mechanism for transfer learning in facial expression recognition. Our proposed architecture provides an explanation for how the human brain might innately recognize facial expressions on varying head shapes. Our model achieves a classification accuracy of 92.15% on the FERG dataset with extreme data efficiency.
arXiv Detail & Related papers (2023-04-05T09:06:30Z)
A domain adaptive deep learning solution for scanpath prediction of paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings. We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans. The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z)
Ensembling with Deep Generative Views [72.70801582346344]
generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose. Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification. We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars.
arXiv Detail & Related papers (2021-04-29T17:58:35Z)
Self-Supervised Linear Motion Deblurring [112.75317069916579]
Deep convolutional neural networks are state-of-the-art for image deblurring. We present a differentiable reblur model for self-supervised motion deblurring. Our experiments demonstrate that self-supervised single image deblurring is really feasible.
arXiv Detail & Related papers (2020-02-10T20:15:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.