CLIP Brings Better Features to Visual Aesthetics Learners
- URL: http://arxiv.org/abs/2307.15640v1
- Date: Fri, 28 Jul 2023 16:00:21 GMT
- Title: CLIP Brings Better Features to Visual Aesthetics Learners
- Authors: Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li
- Abstract summary: Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure.
In this work, an unified and flexible two-phase textbfCLIP-based textbfSemi-supervised textbfKnowledge textbfDistillation paradigm is proposed, namely textbftextitCSKD.
- Score: 12.0962117940694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of pre-training approaches on a variety of downstream tasks has
revitalized the field of computer vision. Image aesthetics assessment (IAA) is
one of the ideal application scenarios for such methods due to subjective and
expensive labeling procedure. In this work, an unified and flexible two-phase
\textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge
\textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}.
Specifically, we first integrate and leverage a multi-source unlabeled dataset
to align rich features between a given visual encoder and an off-the-shelf CLIP
image encoder via feature alignment loss. Notably, the given visual encoder is
not limited by size or structure and, once well-trained, it can seamlessly
serve as a better visual aesthetic learner for both student and teacher. In the
second phase, the unlabeled data is also utilized in semi-supervised IAA
learning to further boost student model performance when applied in
latency-sensitive production scenarios. By analyzing the attention distance and
entropy before and after feature alignment, we notice an alleviation of feature
collapse issue, which in turn showcase the necessity of feature alignment
instead of training directly based on CLIP image encoder. Extensive experiments
indicate the superiority of CSKD, which achieves state-of-the-art performance
on multiple widely used IAA benchmarks.
Related papers
- Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation [14.998239253285394]
We propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance.
We show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
arXiv Detail & Related papers (2024-05-14T09:28:25Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation [20.880942041889444]
We propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel.
Specifically, we introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers.
We demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes.
arXiv Detail & Related papers (2023-12-20T04:27:13Z) - ICPC: Instance-Conditioned Prompting with Contrastive Learning for
Semantic Segmentation [26.25673603166731]
Recent work shows that transferring the knowledge from CLIP to semantic segmentation via prompt learning can achieve promising performance.
We focus on improving the quality of vision-text alignment from two aspects of prompting design and loss function.
We propose an align-guided contrastive loss to refine the alignment of vision and text embeddings.
arXiv Detail & Related papers (2023-08-14T11:21:47Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Visual Alignment Constraint for Continuous Sign Language Recognition [74.26707067455837]
Vision-based Continuous Sign Language Recognition aims to recognize unsegmented gestures from image sequences.
In this work, we revisit the overfitting problem in recent CTC-based CSLR works and attribute it to the insufficient training of the feature extractor.
We propose a Visual Alignment Constraint (VAC) to enhance the feature extractor with more alignment supervision.
arXiv Detail & Related papers (2021-04-06T07:24:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.