Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual
Categorization
- URL: http://arxiv.org/abs/2003.09150v3
- Date: Tue, 21 Jul 2020 14:15:33 GMT
- Title: Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual
Categorization
- Authors: Fan Zhang, Meng Li, Guisheng Zhai, Yizhao Liu
- Abstract summary: ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is one of the most authoritative academic competitions in the field of Computer Vision (CV) in recent years.
Applying ILSVRC's annual champion directly to fine-grained visual categorization (FGVC) tasks does not achieve good performance.
Our approach can be trained end-to-end, while provides short inference time.
- Score: 6.415792312027131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is one of the most
authoritative academic competitions in the field of Computer Vision (CV) in
recent years. But applying ILSVRC's annual champion directly to fine-grained
visual categorization (FGVC) tasks does not achieve good performance. To FGVC
tasks, the small inter-class variations and the large intra-class variations
make it a challenging problem. Our attention object location module (AOLM) can
predict the position of the object and attention part proposal module (APPM)
can propose informative part regions without the need of bounding-box or part
annotations. The obtained object images not only contain almost the entire
structure of the object, but also contains more details, part images have many
different scales and more fine-grained features, and the raw images contain the
complete object. The three kinds of training images are supervised by our
multi-branch network. Therefore, our multi-branch and multi-scale learning
network(MMAL-Net) has good classification ability and robustness for images of
different scales. Our approach can be trained end-to-end, while provides short
inference time. Through the comprehensive experiments demonstrate that our
approach can achieves state-of-the-art results on CUB-200-2011, FGVC-Aircraft
and Stanford Cars datasets. Our code will be available at
https://github.com/ZF1044404254/MMAL-Net
Related papers
- Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models [0.6149772262764599]
We introduce the Vision-Instructed and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Questioning (VQA) problem.
Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
arXiv Detail & Related papers (2024-03-15T13:29:41Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z) - Self-attention on Multi-Shifted Windows for Scene Segmentation [14.47974086177051]
We explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features.
We propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction.
Our models achieve very promising performance on four public scene segmentation datasets.
arXiv Detail & Related papers (2022-07-10T07:36:36Z) - Integrative Few-Shot Learning for Classification and Segmentation [37.50821005917126]
We introduce the integrative task of few-shot classification and segmentation (FS-CS)
FS-CS aims to classify and segment target objects in a query image when the target classes are given with a few examples.
We propose the integrative few-shot learning framework for FS-CS, which trains a learner to construct class-wise foreground maps.
arXiv Detail & Related papers (2022-03-29T16:14:40Z) - PartImageNet: A Large, High-Quality Dataset of Parts [16.730418538593703]
We propose PartImageNet, a high-quality dataset with part segmentation annotations.
PartImageNet is unique because it offers part-level annotations on a general set of classes with non-rigid, articulated objects.
It can be utilized in multiple vision tasks including but not limited to: Part Discovery, Few-shot Learning.
arXiv Detail & Related papers (2021-12-02T02:12:03Z) - Unsupervised Object-Level Representation Learning from Scene Images [97.07686358706397]
Object-level Representation Learning (ORL) is a new self-supervised learning framework towards scene images.
Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence.
ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks.
arXiv Detail & Related papers (2021-06-22T17:51:24Z) - KiU-Net: Overcomplete Convolutional Architectures for Biomedical Image
and Volumetric Segmentation [71.79090083883403]
"Traditional" encoder-decoder based approaches perform poorly in detecting smaller structures and are unable to segment boundary regions precisely.
We propose KiU-Net which has two branches: (1) an overcomplete convolutional network Kite-Net which learns to capture fine details and accurate edges of the input, and (2) U-Net which learns high level features.
The proposed method achieves a better performance as compared to all the recent methods with an additional benefit of fewer parameters and faster convergence.
arXiv Detail & Related papers (2020-10-04T19:23:33Z) - Multiple instance learning on deep features for weakly supervised object
detection with extreme domain shifts [1.9336815376402716]
Weakly supervised object detection (WSOD) using only image-level annotations has attracted a growing attention over the past few years.
We show that a simple multiple instance approach applied on pre-trained deep features yields excellent performances on non-photographic datasets.
arXiv Detail & Related papers (2020-08-03T20:36:01Z) - Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner.
We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.