Related papers: Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information

Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information

URL: http://arxiv.org/abs/2203.03253v1
Date: Mon, 7 Mar 2022 10:21:59 GMT
Title: Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information
Authors: Lingfeng Yang, Xiang Li, Renjie Song, Borui Zhao, Juntian Tao, Shihao Zhou, Jiajun Liang, Jian Yang
Abstract summary: Fine-grained image classification is a challenging computer vision task where various species share similar visual appearances. It is helpful to leverage additional information, e.g., the locations and dates for data shooting, which can be easily accessible but rarely exploited. We propose a dynamic algorithm on top of the image representation, which interacts with multimodal features at a higher and broader dimension.
Score: 19.99135128298929
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Fine-grained image classification is a challenging computer vision task where various species share similar visual appearances, resulting in misclassification if merely based on visual clues. Therefore, it is helpful to leverage additional information, e.g., the locations and dates for data shooting, which can be easily accessible but rarely exploited. In this paper, we first demonstrate that existing multimodal methods fuse multiple features only on a single dimension, which essentially has insufficient help in feature discrimination. To fully explore the potential of multimodal information, we propose a dynamic MLP on top of the image representation, which interacts with multimodal features at a higher and broader dimension. The dynamic MLP is an efficient structure parameterized by the learned embeddings of variable locations and dates. It can be regarded as an adaptive nonlinear projection for generating more discriminative image representations in visual tasks. To our best knowledge, it is the first attempt to explore the idea of dynamic networks to exploit multimodal information in fine-grained image classification tasks. Extensive experiments demonstrate the effectiveness of our method. The t-SNE algorithm visually indicates that our technique improves the recognizability of image representations that are visually similar but with different categories. Furthermore, among published works across multiple fine-grained datasets, dynamic MLP consistently achieves SOTA results https://paperswithcode.com/dataset/inaturalist and takes third place in the iNaturalist challenge at FGVC8 https://www.kaggle.com/c/inaturalist-2021/leaderboard. Code is available at https://github.com/ylingfeng/DynamicMLP.git

Related papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models [11.683093317651517]
Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. We present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets.
arXiv Detail & Related papers (2024-05-24T16:05:15Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge. We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z)
Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL) A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z)
Self-attention on Multi-Shifted Windows for Scene Segmentation [14.47974086177051]
We explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features. We propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction. Our models achieve very promising performance on four public scene segmentation datasets.
arXiv Detail & Related papers (2022-07-10T07:36:36Z)
Facing the Void: Overcoming Missing Data in Multi-View Imagery [0.783788180051711]
We propose a novel technique for multi-view image classification robust to this problem. The proposed method, based on state-of-the-art deep learning-based approaches and metric learning, can be easily adapted and exploited in other applications and domains. Results show that the proposed algorithm provides improvements in multi-view image classification accuracy when compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-05-21T13:21:27Z)
Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition. We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction. We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z)
An Image Patch is a Wave: Phase-Aware Vision MLP [54.104040163690364]
multilayer perceptron (MLP) is a new kind of vision model with extremely simple architecture that only stacked by fully-connected layers. We propose to represent each token as a wave function with two parts, amplitude and phase. Experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art architectures on various vision tasks.
arXiv Detail & Related papers (2021-11-24T06:25:49Z)
Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match. The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.