Related papers: Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification

Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification

URL: http://arxiv.org/abs/2203.15381v1
Date: Tue, 29 Mar 2022 09:21:22 GMT
Title: Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification
Authors: Shi Pu and Kaili Zhao and Mao Zheng
Abstract summary: This paper presents an end-to-end framework that preserves alignment and uniformity properties for representations on both seen and unseen classes. Experiments show that our method significantly outperforms SoTA by relative improvements of 28.1% on UCF101 and 27.0% on HMDB51.
Score: 3.6954802719347413
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most methods tackle zero-shot video classification by aligning visual-semantic representations within seen classes, which limits generalization to unseen classes. To enhance model generalizability, this paper presents an end-to-end framework that preserves alignment and uniformity properties for representations on both seen and unseen classes. Specifically, we formulate a supervised contrastive loss to simultaneously align visual-semantic features (i.e., alignment) and encourage the learned features to distribute uniformly (i.e., uniformity). Unlike existing methods that only consider the alignment, we propose uniformity to preserve maximal-info of existing features, which improves the probability that unobserved features fall around observed data. Further, we synthesize features of unseen classes by proposing a class generator that interpolates and extrapolates the features of seen classes. Besides, we introduce two metrics, closeness and dispersion, to quantify the two properties and serve as new measurements of model generalizability. Experiments show that our method significantly outperforms SoTA by relative improvements of 28.1% on UCF101 and 27.0% on HMDB51. Code is available.

Related papers

Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation [16.791911185682256]
We propose a method that leverages the semantic structure inherent in class labels to reduce reliance on spurious features.<n>Our model employs gradient-based attention guided by a pre-trained vision-language model to disentangle superclass-relevant and irrelevant features.<n>Our approach achieves robustness to more complex spurious correlations without the need to annotate any source samples.
arXiv Detail & Related papers (2025-08-12T02:16:04Z)
Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability [54.420663939897686]
We propose the Attribute-formed Language Bottleneck Model (ALBM) to achieve interpretable image recognition. ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. To further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes.
arXiv Detail & Related papers (2025-03-26T07:59:04Z)
An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z)
Dual Feature Augmentation Network for Generalized Zero-shot Learning [14.410978100610489]
Zero-shot learning (ZSL) aims to infer novel classes without training samples by transferring knowledge from seen classes. Existing embedding-based approaches for ZSL typically employ attention mechanisms to locate attributes on an image. We propose a novel Dual Feature Augmentation Network (DFAN), which comprises two feature augmentation modules.
arXiv Detail & Related papers (2023-09-25T02:37:52Z)
Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition [11.110124286206467]
We propose a uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. Our method outperforms previous vision-language methods for long-tailed learning work by a large margin and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-24T03:21:28Z)
CAD: Co-Adapting Discriminative Features for Improved Few-Shot Classification [11.894289991529496]
Few-shot classification is a challenging problem that aims to learn a model that can adapt to unseen classes given a few labeled samples. Recent approaches pre-train a feature extractor, and then fine-tune for episodic meta-learning. We propose a strategy to cross-attend and re-weight discriminative features for few-shot classification.
arXiv Detail & Related papers (2022-03-25T06:14:51Z)
Dual Prototypical Contrastive Learning for Few-shot Semantic Segmentation [55.339405417090084]
We propose a dual prototypical contrastive learning approach tailored to the few-shot semantic segmentation (FSS) task. The main idea is to encourage the prototypes more discriminative by increasing inter-class distance while reducing intra-class distance in prototype feature space. We demonstrate that the proposed dual contrastive learning approach outperforms state-of-the-art FSS methods on PASCAL-5i and COCO-20i datasets.
arXiv Detail & Related papers (2021-11-09T08:14:50Z)
Concurrent Discrimination and Alignment for Self-Supervised Feature Learning [52.213140525321165]
Existing self-supervised learning methods learn by means of pretext tasks which are either (1) discriminating that explicitly specify which features should be separated or (2) aligning that precisely indicate which features should be closed together. In this work, we combine the positive aspects of the discriminating and aligning methods, and design a hybrid method that addresses the above issue. Our method explicitly specifies the repulsion and attraction mechanism respectively by discriminative predictive task and concurrently maximizing mutual information between paired views. Our experiments on nine established benchmarks show that the proposed model consistently outperforms the existing state-of-the-art results of self-supervised and transfer
arXiv Detail & Related papers (2021-08-19T09:07:41Z)
CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition [52.66360172784038]
We propose a clustering-based model, which considers all training samples at once, instead of optimizing for each instance individually. We call the proposed method CLASTER and observe that it consistently improves over the state-of-the-art in all standard datasets.
arXiv Detail & Related papers (2021-01-18T12:46:24Z)
Entropy-Based Uncertainty Calibration for Generalized Zero-Shot Learning [49.04790688256481]
The goal of generalized zero-shot learning (GZSL) is to recognise both seen and unseen classes. Most GZSL methods typically learn to synthesise visual representations from semantic information on the unseen classes. We propose a novel framework that leverages dual variational autoencoders with a triplet loss to learn discriminative latent features.
arXiv Detail & Related papers (2021-01-09T05:21:27Z)
Bidirectional Mapping Coupled GAN for Generalized Zero-Shot Learning [7.22073260315824]
Bidirectional mapping-based generalized zero-shot learning (GZSL) methods rely on the quality of synthesized features to recognize seen and unseen data. We learn a joint distribution of seen-unseen domains and preserving domain distinction is crucial for these methods. In this work, we utilize the available unseen class semantics alongside seen class semantics and learn joint distribution through a strong visual-semantic coupling.
arXiv Detail & Related papers (2020-12-30T06:11:29Z)
Learning and Evaluating Representations for Deep One-class Classification [59.095144932794646]
We present a two-stage framework for deep one-class classification. We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations. In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks.
arXiv Detail & Related papers (2020-11-04T23:33:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.