Dynamic MLP for Fine-Grained Image Classification by Leveraging
Geographical and Temporal Information
- URL: http://arxiv.org/abs/2203.03253v1
- Date: Mon, 7 Mar 2022 10:21:59 GMT
- Title: Dynamic MLP for Fine-Grained Image Classification by Leveraging
Geographical and Temporal Information
- Authors: Lingfeng Yang, Xiang Li, Renjie Song, Borui Zhao, Juntian Tao, Shihao
Zhou, Jiajun Liang, Jian Yang
- Abstract summary: Fine-grained image classification is a challenging computer vision task where various species share similar visual appearances.
It is helpful to leverage additional information, e.g., the locations and dates for data shooting, which can be easily accessible but rarely exploited.
We propose a dynamic algorithm on top of the image representation, which interacts with multimodal features at a higher and broader dimension.
- Score: 19.99135128298929
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Fine-grained image classification is a challenging computer vision task where
various species share similar visual appearances, resulting in
misclassification if merely based on visual clues. Therefore, it is helpful to
leverage additional information, e.g., the locations and dates for data
shooting, which can be easily accessible but rarely exploited. In this paper,
we first demonstrate that existing multimodal methods fuse multiple features
only on a single dimension, which essentially has insufficient help in feature
discrimination. To fully explore the potential of multimodal information, we
propose a dynamic MLP on top of the image representation, which interacts with
multimodal features at a higher and broader dimension. The dynamic MLP is an
efficient structure parameterized by the learned embeddings of variable
locations and dates. It can be regarded as an adaptive nonlinear projection for
generating more discriminative image representations in visual tasks. To our
best knowledge, it is the first attempt to explore the idea of dynamic networks
to exploit multimodal information in fine-grained image classification tasks.
Extensive experiments demonstrate the effectiveness of our method. The t-SNE
algorithm visually indicates that our technique improves the recognizability of
image representations that are visually similar but with different categories.
Furthermore, among published works across multiple fine-grained datasets,
dynamic MLP consistently achieves SOTA results
https://paperswithcode.com/dataset/inaturalist and takes third place in the
iNaturalist challenge at FGVC8
https://www.kaggle.com/c/inaturalist-2021/leaderboard. Code is available at
https://github.com/ylingfeng/DynamicMLP.git
Related papers
- What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models [11.683093317651517]
Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification.
We present a simple yet effective approach for zero-shot image classification using multimodal LLMs.
Our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets.
arXiv Detail & Related papers (2024-05-24T16:05:15Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Self-attention on Multi-Shifted Windows for Scene Segmentation [14.47974086177051]
We explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features.
We propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction.
Our models achieve very promising performance on four public scene segmentation datasets.
arXiv Detail & Related papers (2022-07-10T07:36:36Z) - Facing the Void: Overcoming Missing Data in Multi-View Imagery [0.783788180051711]
We propose a novel technique for multi-view image classification robust to this problem.
The proposed method, based on state-of-the-art deep learning-based approaches and metric learning, can be easily adapted and exploited in other applications and domains.
Results show that the proposed algorithm provides improvements in multi-view image classification accuracy when compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-05-21T13:21:27Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - An Image Patch is a Wave: Phase-Aware Vision MLP [54.104040163690364]
multilayer perceptron (MLP) is a new kind of vision model with extremely simple architecture that only stacked by fully-connected layers.
We propose to represent each token as a wave function with two parts, amplitude and phase.
Experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art architectures on various vision tasks.
arXiv Detail & Related papers (2021-11-24T06:25:49Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.