DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition
- URL: http://arxiv.org/abs/2401.01387v2
- Date: Fri, 1 Mar 2024 06:38:28 GMT
- Title: DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition
- Authors: Parul Gupta, Tuan Nguyen, Abhinav Dhall, Munawar Hayat, Trung Le and
Thanh-Toan Do
- Abstract summary: We introduce DiffAugment -- a method which augments the tail classes in the linguistic space by making use of WordNet.
We demonstrate the effectiveness of hardness-aware diffusion in generating visual embeddings for the tail classes.
We also propose a novel subject and object based seeding strategy for diffusion sampling which improves the discriminative capability of the generated visual embeddings.
- Score: 43.01467525231004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Visual Relationship Recognition (VRR) aims to identify
relationships between two interacting objects in an image and is particularly
challenging due to the widely-spread and highly imbalanced distribution of
<subject, relation, object> triplets. To overcome the resultant performance
bias in existing VRR approaches, we introduce DiffAugment -- a method which
first augments the tail classes in the linguistic space by making use of
WordNet and then utilizes the generative prowess of Diffusion Models to expand
the visual space for minority classes. We propose a novel hardness-aware
component in diffusion which is based upon the hardness of each <S,R,O> triplet
and demonstrate the effectiveness of hardness-aware diffusion in generating
visual embeddings for the tail classes. We also propose a novel subject and
object based seeding strategy for diffusion sampling which improves the
discriminative capability of the generated visual embeddings. Extensive
experimentation on the GQA-LT dataset shows favorable gains in the
subject/object and relation average per-class accuracy using Diffusion
augmented samples.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement [58.9768112704998]
Disentangled representation learning strives to extract the intrinsic factors within observed data.
We introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias.
This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs.
arXiv Detail & Related papers (2024-02-15T05:07:54Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions [11.121652649243119]
Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation.
We propose a novel approach termed the detail reinforcement diffusion model(DRDM)
It leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference(SKR)
arXiv Detail & Related papers (2023-09-15T01:28:59Z) - InfoDiffusion: Representation Learning Using Information Maximizing
Diffusion Models [35.566528358691336]
InfoDiffusion is an algorithm that augments diffusion models with low-dimensional latent variables.
InfoDiffusion relies on a learning objective regularized with the mutual information between observed and hidden variables.
We find that InfoDiffusion learns disentangled and human-interpretable latent representations that are competitive with state-of-the-art generative and contrastive methods.
arXiv Detail & Related papers (2023-06-14T21:48:38Z) - DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery [20.787180028571694]
DiffusionSeg is a synthesis-exploitation framework containing two-stage strategies.
We synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first stage.
In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features.
arXiv Detail & Related papers (2023-03-17T07:47:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.