HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
- URL: http://arxiv.org/abs/2407.16244v1
- Date: Tue, 23 Jul 2024 07:31:42 GMT
- Title: HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
- Authors: Shuyi Ouyang, Hongyi Wang, Ziwei Niu, Zhenjia Bai, Shiao Xie, Yingying Xu, Ruofeng Tong, Yen-Wei Chen, Lanfen Lin,
- Abstract summary: Tight visual-linguistic interactions play a vital role in improving classification performance.
Recent Transformer-based methods have achieved great success in multi-label image classification.
We propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs.
- Score: 15.129037250680582
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)~A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)~Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.
Related papers
- PSVMA+: Exploring Multi-granularity Semantic-visual Adaption for Generalized Zero-shot Learning [116.33775552866476]
Generalized zero-shot learning (GZSL) endeavors to identify the unseen using knowledge from the seen domain.
GZSL suffers from insufficient visual-semantic correspondences due to attribute diversity and instance diversity.
We propose a multi-granularity progressive semantic-visual adaption network, where sufficient visual elements can be gathered to remedy the inconsistency.
arXiv Detail & Related papers (2024-10-15T12:49:33Z) - Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval [37.775529830620016]
Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain.
Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately.
We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation.
arXiv Detail & Related papers (2024-05-29T10:19:11Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition [24.406654146411682]
Vision Transformer (ViT) is the research base for this paper.
Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images.
We propose a weakly supervised object localization-based approach to extract multi-scale local features.
arXiv Detail & Related papers (2022-04-22T14:38:40Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - w-Net: Dual Supervised Medical Image Segmentation Model with
Multi-Dimensional Attention and Cascade Multi-Scale Convolution [47.56835064059436]
Multi-dimensional attention segmentation model with cascade multi-scale convolution is proposed to predict accurate segmentation for small objects in medical images.
The proposed method is evaluated on three datasets: KiTS19, Pancreas CT of Decathlon-10, and MICCAI 2018 LiTS Challenge.
arXiv Detail & Related papers (2020-11-15T13:54:22Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.