Learning feature representation from discriminative local regions plays a key
role in fine-grained visual classification. Employing attention mechanisms to
extract part features has become a trend. However, there are two major
limitations in these methods: First, they often focus on the most salient part
while neglecting other inconspicuous but distinguishable parts. Second, they
treat different part features in isolation while neglecting their
relationships. To handle these limitations, we propose to locate multiple
different distinguishable parts and explore their relationships in an explicit
way. In this pursuit, we introduce two lightweight modules that can be easily
plugged into existing convolutional neural networks. On one hand, we introduce
a feature boosting and suppression module that boosts the most salient part of
feature maps to obtain a part-specific representation and suppresses it to
force the following network to mine other potential parts. On the other hand,
we introduce a feature diversification module that learns semantically
complementary information from the correlated part-specific representations.
Our method does not need bounding boxes/part annotations and can be trained
end-to-end. Extensive experimental results show that our method achieves
state-of-the-art performances on several benchmark fine-grained datasets.
for Fine-Grained Visual Classification Jianwei Song, Ruoyu Yang
細粒度視覚分類法 Jianwei Song, Ruoyu Yang
0.71
National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China songjianwei@smail.nj u.edu.cn, yangry@nju.edu.cn
New Software Technology Nanjing University, Nanjing 210023, China songjianwei@smail.nj u.edu.cn, yangry@nju.edu.cn
0.85
1 2 0 2 r a M 4 ] V C .
1 2 0 2 r a m 4 ] v c である。
0.79
s c [ 1 v 2 8 7 2 0 .
s c [ 1 v 2 8 7 2 0 .
0.85
3 0 1 2 : v i X r a
3 0 1 2 : v i X r a
0.85
Abstract—Learning feature representation from discriminative local regions plays a key role in fine-grained visual classification.
abstract — learning feature representation from discriminative local regions are be key role in fine-grained visual classification(英語)
0.76
Employing attention mechanisms to extract part features has become a trend.
部分的特徴抽出のための注意機構の活用がトレンドとなっている。
0.68
However, there are two major limitations in these methods: First, they often focus on the most salient part while neglecting other inconspicuous but distinguishable parts.
In this pursuit, we introduce two lightweight modules that can be easily plugged into existing convolutional neural networks.
本稿では,既存の畳み込みニューラルネットワークに簡単に接続可能な2つの軽量モジュールを提案する。
0.79
On one hand, we introduce a feature boosting and suppression module that boosts the most salient part of feature maps to obtain a part-specific representation and suppresses it to force the following network to mine other potential parts.
On the other hand, we introduce a feature diversification module that learns semantically complementary information from the correlated part-specific representations.
一方,相関した部分固有表現から意味的に補完的な情報を学習する特徴多様化モジュールを提案する。
0.80
Our method does not need bounding boxes/part annotations and can be trained end-to-end.
Index Terms—fine-grained, visual classification, attention, feature diversification, part-specific feature
索引項-きめ細かい、視覚的な分類、注意、特徴の多様化、部分固有の特徴
0.60
I. INTRODUCTION Fine-grained visual classification (FGVC) focuses on distinguishing subtle visual differences within a basic-level category, e g , species of birds [1] and dogs [2], and models of aircrafts [3] and cars [4].
Recently, convolutional neural networks (CNNs) have made great progress on many vision tasks, such as image caption [5], semantic segmentation [6], object detection [7] [8], etc.
However, traditional CNNs are not powerful enough to capture the subtle discriminative features due to the large intra-class and small inter-class variations as shown in Fig 1, which makes FGVC still a challenging task.
The images with large variations in each row belong to the same class.
各列に大きなバリエーションを持つ画像は同じクラスに属する。
0.72
However, the images with small variations in each column belong to different classes.
ただし、各列に小さなバリエーションを持つ画像は、異なるクラスに属します。
0.78
This situation is opposite to generic visual classification.
この状況は一般的な視覚分類とは反対である。
0.64
composed of two different subnetworks.
2つのサブネットワークで構成される。
0.55
Specifically, a localization subnetwork with attention mechanisms is designed for locating discriminative parts and a classification subnetwork is followed for recognition.
The dedicated loss functions are designed to optimize both subnetworks.
専用損失関数は、両方のサブネットワークを最適化するために設計されている。
0.53
The limitation of these methods is that it is difficult to optimize because of the specially designed attention modules and loss functions.
これらの方法の限界は、特別に設計された注意モジュールと損失関数のため最適化が難しいことである。
0.80
The other is based on high-order information, these methods [18] [19] [20] [21] [22] argue that the first-order information is not sufficient to model the differences and instead use highorder information to encode the discrimination.
The limitation of these methods is that it takes up a lot of GPU resources and has poor interpretability.
これらの手法の限界は、多くのGPUリソースを消費し、解釈性に乏しいことである。
0.68
We propose feature boosting, suppression, and diversifying towards both efficiency and interpretability.
効率性と解釈性を兼ね備えた機能強化・抑制・多様化を提案します。
0.63
We argue that attention-based methods tend to focus on the most salient part, so other inconspicuous but distinguishable parts have no chance to stand out.
Based on this simple and effective idea, we introduce a feature boosting and suppression module (FBSM), which highlights the most salient part of feature maps at the current stage to obtain a part-specific representation and sup-
presses it to force the following stage to mine other potential parts.
押すと次のステージに 他の潜在的な部品を採掘させます
0.70
By inserting FBSMs into the middle layers of CNNs, we can get multiple part-specific feature representations that are explicitly concentrated on different object parts.
Our contributions are summarized as follows: • We propose a feature boosting and suppression module, which can explicitly force the network to focus on multiple discriminative parts.
• We propose a feature diversification module, which can model part interaction and diversify each part-specific representation.
• 部品相互作用をモデル化し、各部品固有の表現を多様化できる機能分散モジュールを提案する。
0.74
II. RELATED WORK Below, we review the most representative methods related to our method.
II。 関連作業 以下に,本手法に関する代表的手法について概説する。
0.67
A. Fine-Grained Feature Learning Ding et al [17] proposed sparse selective sampling learning to obtain both discriminative and complementary regions.
A. Fine-Grained Feature Learning Ding et al [17] proposed sparse selective sample learning to obtain both discriminative and complementary region。
0.91
Sun et al [24] proposed a one-squeeze multi-excitation module to learn multiple parts, then applied a multi-attention multiclass constraint on these parts.
Sun et al [24]は、複数の部分を学ぶための一列マルチエキサイティングモジュールを提案し、これらの部分にマルチアテンションマルチクラス制約を適用した。
0.68
Zhang et al [25] proposed to discover contrastive clues by comparing image pairs.
Zhang et al [25] は画像ペアを比較して対照的な手がかりを発見することを提案した。
0.59
Yang et al [15] introduced a navigator-teacher-sc rutinizer network to obtain discriminative regions.
Yangらは[15]、識別領域を得るためのナビゲータ-教師-精査ネットワークを導入した。
0.53
Luo et al [26] proposed Cross-X learning to explore the relationships between different images and different layers.
Luo et al [26]は、異なる画像と異なるレイヤーの関係を探求するためにCross-X学習を提案した。 訳抜け防止モード: Luo et al [26 ] proposed Cross - X learning 異なる画像と異なる層の間の関係を探るのです
0.84
Gao et al [27] proposed to model channel interaction to capture subtle differences.
Gao et al [27]は、チャネル相互作用をモデル化し、微妙な違いを捉えることを提案した。
0.56
Li et al [20] proposed to capture the discrimination by matrix square root normalization and introduced an iterative method for fast end-to-end training.
Li et al [20] は行列平方根正規化による識別を捉えることを提案し、高速なエンドツーエンドトレーニングのための反復的手法を導入した。
0.60
Our method utilizes feature boosting and suppression to learn different part representations in an explicit way, which is significantly different from previous methods.
Our FDM is similar with [29] and [27], but there are essential differences: (1) SG-Net tends to explore positive correlations to capture long-range dependencies, while FDM tends to explore negative correlations to diversify the feature representation.
(2) CIN mines complementary information along channel dimension whereas FDM along the spatial dimension.
2) CINはチャネル次元に沿って補完情報をマイニングするが, FDMは空間次元に沿って採掘する。
0.66
III. METHODOLOGY In this section, we will detail the proposed method.
III。 方法論 本節では,提案手法について詳述する。
0.62
An overview of the framework is shown in Fig 2.
フレームワークの概要は図2に示します。
0.64
Our model consists of two lightweight modules: (1) A feature boosting and suppression module (FBSM) aiming at learning multiple discriminative part-specific representations as different as possible.
A. Feature Boosting and Suppression Module Given feature maps X ∈ RC×W×H from a specific layer, where C, W , H represents the number of channels, width and height respectively.
A。 特徴ブースティングと抑圧モジュール 特定の層から特徴写像 X ∈ RC×W×H を与えられたとき、C, W , H はそれぞれチャネルの数、幅、高さを表す。
0.77
We split X evenly into k parts along width dimension [30] and denote each striped parts as X(i) ∈ RC×(W/k)×H, i ∈ [1, k].
幅寸法[30]に沿ってXをk部分に均等に分割し、各ストライプパーツをX(i) ∈ RC×(W/k)×H, i ∈ [1, k]と表現します。
0.85
Then we employ a 1× 1 convolution φ to explore the importance of each part: (1) The nonlinear function Relu [31] is applied to remove the negative activations.
We then obtain the boosting feature Xb by boosting the most salient part: (4) where α is a hyper-parameter, which controls the extent of boosting, ⊗ denotes element-wise multiplication.
A convolutional layer h is applied on Xb to get a part-specific representation Xp: (5) By suppressing the most striped part, we can obtain the suppression feature Xs:
Xb に畳み込み層 h を適用して部分特異的表現 Xp: (5) 最も縞模様のある部分を抑制することで、抑制特徴 Xs を得ることができる。
0.86
Xb = X + α ∗ (B ⊗ X)
Xb = X + α ∗ (B = X)
0.83
Xp = h(Xb)
Xp = h(Xb)
0.85
Xs = S ⊗ X
Xs = S > X
0.81
(6)
(6)
0.85
英語(論文から抽出)
日本語訳
スコア
Fig. 2. The overview of our method.
フィギュア。 2. 我々の方法を概観する。
0.69
(cid:40) Fig.
(cid:40)図。
0.81
3. The diagram of the FBSM.
3. FBSMの図。
0.68
1 − β, 1, if bi = max(B) otherwise
1 − β, 1, もし bi = max(B) でなければ
0.88
si = (7) where S = (s1,··· , sk)T , β is a hyper-parameter, which control the extent of suppressing.
si = (7) ここで s = (s1,···· , sk)t , β は超パラメータであり、抑制の程度を制御する。
0.82
the functionality of FBSM can be expressed In short, as: FBSM(X) = (Xp, Xs).
fbsmの機能は、略して fbsm(x) = (xp, xs) と表現できる。
0.53
Given feature maps X, FBSM outputs part-specific feature Xp and potential feature maps Xs.
特徴写像 X が与えられると、FBSM は部分固有の特徴写像 Xp と潜在的な特徴写像 Xs を出力する。
0.52
Since Xs suppresses the most salient part in current stage, other potential parts will stand out after feeding Xs into the following stage.
We first discuss how two part-specific features diversify each other with the pairwise complement module (PCM).
まず,2つの部分特化特徴が相互補完モジュール (PCM) によってどのように多様化するかを論じる。
0.67
A simple illustration of PCM is shown in Fig 4.
PCMの簡単なイラストを図4に示します。
0.79
Without loss of generality, we denote two different part-specific features as Xp1 ∈ RC×W1H1 and Xp2 ∈ RC×W2H2, where C denotes the number of channels, W1H1 and W2H2 denote their spatial size respectively.
At training time, we compute the classification loss for each enhanced part-specific feature Zpi: cls = −yT log(pi), Li
トレーニング時に、拡張された各部分固有の特徴 zpi: cls = −yt log(pi), li の分類損失を計算する。
0.76
(17) where y is the ground-truth label of the input image and represented by one-hot vector, clsi is a classifier for the ith part, pi ∈ RN is the prediction score vector, N is the number of object categories.
(17) y が入力画像の接地真ラベルであり、1ホットベクトルで表される場合、clsi はイット部分の分類器であり、pi ∈ RN は予測スコアベクトルであり、N はオブジェクトカテゴリの数である。
0.85
The final optimization objective is:
最終的な最適化目標は
0.83
pi = softmax(clsi(Zpi)) T(cid:88) i=1
pi = softmax(clsi(Zpi)) T(cid:88) i=1
0.93
Li cls (18)
Li cls (18)
0.85
L = where T = 3 is the number of enhanced part-specific features.
L = ここで T = 3 は強化された部分特有な特徴の数である。
0.75
At inference time, we take the average of prediction scores for all enhanced part-specific features as the final prediction result.
推測時において,全ての拡張部分特化特徴に対する予測スコアの平均を最終予測結果とみなす。
0.85
IV. EXPERIMENTS A. Datasets and Baselines We evaluate our model on four commonly used datasets: CUB-200-2011 [1], FGVC-Aircraft [3], Stanford Cars [4], Stanford Dogs [2].
• NTS [15]: guides region proposal network by forcing the consistency between informativeness of the regions and their probabilities being ground-truth class.
(12) (13) pi denotes the complementary information of Xpi p1 can
(12) (13) pi は Xpi p1 can の相補的な情報を表す
0.90
where softmax is performed column-wise.
カラムごとにsoftmaxが実行される。
0.59
Then we can get the complementary information: = Xp2 Ap2 p1 = Xp1 Ap1 p2 (cid:88) j∈[1,W2H2] takes all pixels of Xp2 as references, i.e., each pixel of Y p2 p1 and the higher the complementarity between pixel(Xp1, i) and pixel(Xp2, j) is, the greater the contribution of pixel(Xp2, j) to pixel(Y p2 in these two p1 part-specific features can mine semantically complementary information from each other.
Formally, given a collection of part-specific features P = {Xp1, Xp2, Xp3 ··· , Xpn}, (cid:88) the complementary information of Xpi is: (15) Y pj pi Xpj ∈P∧i(cid:54)=j
形式的には、部分特異な特徴 P = {Xp1, Xp2, Xp3 · · , Xpn}, (cid:88) の集合を考えると、Xpi の相補的情報は (15) Y pj pi Xpj ∈P\i(cid:54)=j である。
0.74
, i) is. In this way, every pixel
、私は)です。 このように、すべてのピクセル。
0.72
Ypi = where Y pj pi can be obtained by applying Xpi and Xpj on (9), (10), and (12).
• Cross-X [26]: proposes to learn multi-scale feature representation between different layers and different images.
• Cross-X [26]: 異なるレイヤーと異なるイメージ間でのマルチスケールな特徴表現の学習を提案する。 訳抜け防止モード: •クロス-X [26 ] : 提案 to learn multi-scale feature representation between different layer and different image。
B. Implementation Details We validate the performance of our method on Resnet50 and Resnet101 [33], which are all pre-trained on the ImageNet dataset [37].
The learning rate of the backbone layers is set to 0.002, and the newly added layers are set to 0.02.
バックボーン層の学習率を0.002とし、新たに追加されたレイヤを0.02とする。
0.73
The learning rate is adjusted by cosine anneal scheduler [38].
学習率はコサインアニールスケジューラ[38]によって調整されます。
0.73
We use PyTorch to implement our experiments.
PyTorchを使って実験を行っています。
0.72
take the raw image as input at
入力として生の画像を取る
0.76
C. Comparison with State-of-the-Art The top-1 classification accuracy on CUB-200-2011 [1], FGVC-Aircraft [3], Stanford Cars [4] and Stanford Dogs [2] datasets are reported in Table II.
The top-1 classification accuracy on CUB-200-2011 [1], FGVC-Aircraft [3], Stanford Cars [4] and Stanford Dogs [2] datasets are reported in Table II。
0.77
Results on CUB-200-2011: CUB-200-2011 is the most challenging benchmark in FGVC, our models based on Resnet50 and Resnet101 both achieve the best performances on this dataset.
Compared with the two-stage methods: RA-CNN, NTS, MGE-CNN, S3N, and FDL, which all the first stage to explore informative regions and takes them as input at the second stage, our model is 4.0%, 1.8%, 0.8%, 0.8%, 0.7% higher than them respectively.
ISQRT-COV and DTB-Net explore high-order information to capture the subtle differences, our method outperforms them by large margins.
ISQRT-COVとDTB-Netは、微妙な違いを捉えるために高次情報を探索します。
0.49
Compared with API-Net and Cross-X, which both take image pairs as input and model the discrimination by part interaction, our model gets 1.6% improvements.
API-NetとCross-Xはどちらもイメージペアを入力として取り、パートインタラクションによる識別をモデル化していますが、私たちのモデルは1.6%改善されています。 訳抜け防止モード: API - NetとCross - Xの比較 イメージペアを入力として、部分的相互作用による識別をモデル化します。 モデルの改善率は1.6%です
0.70
The accuracy of our method is 3.1% higher than MAMC, which formulates part mining into a metric learning problem.
この手法の精度はMAMCよりも3.1%高く、パースマイニングを計量学習問題に定式化する。
0.73
Compared with MACNN, CIN, and LIO, our method is 2.8%, 1.9%, 1.3% higher than them respectively.
Our method exceeds RA-CNN and MA-CNN which use VGG [39] as backbone by large margins.
本手法はVGG[39]をバックボーンとして用いたRA-CNNとMA-CNNを超える。
0.63
With Resnet50 backbone, our method is higher than ISQRT-COV, MAMC, NTS, MEG-CNN, DTBNet, CIN, and FDL but lower than DCL, LIO, Cross-X, S3N, and API-Net.
We suspect that features extracted from shadow layers (stage3&stage4) lack rich semantic information, which may cause degradation of recognition performance.
When deepening the network and taking Resnet101 as the backbone, we obtain the best result of 95.0%.
ネットワークを深くし、Resnet101をバックボーンとすると、95.0%の最良の結果が得られる。
0.78
Results on Stanford Dogs: Most previous methods do not report results on this dataset because of the computational complexity.
以前のほとんどの方法は、計算の複雑さのため、このデータセットの結果を報告しません。
0.61
Our method obtains a competitive result on this dataset and surpasses RA-CNN, MAMC, and FDL by large margins.
本手法は, RA-CNN, MAMC, FDLを大差で上回り, 競合する結果を得る。
0.61
Compared with Cross-X and API-Net which take image pairs as input, our method does not need to consider how to design a non-trivial data sampler to sample inter-class and intra-class image pairs [40].
With Resnet50 backbone, API-Net and Cross-X obtain the best result on Stanford Cars and Stanford Dogs respectively, but both get poor results on CUB-200-2011.
D. Ablation Studies We perform ablation studies to understand the contributions of each proposed module.
D.アブレーション研究 各提案モジュールの貢献を理解するためのアブレーション研究を行います。
0.71
We take experiments on four ABLATION STUDIES ON FOUR BENCHMARK DATASETS Dog 81.1 87.5 88.2
4つの実験を行い FOUR BENCHMARK Datasets Dog 81.1 87.5 88.2の記述
0.72
Bird 85.5 88.9 Resnet50+FBSM+FDM 89.3
Bird 85.5 88.9 Resnet50+FBSM+FDM 89.3
0.42
Methods Resnet50
方法 Resnet50
0.81
Resnet50+FBSM
Resnet50+FBSM
0.47
TABLE III Aircraft 90.3 92.4 92.7
テーブルIII 航空機 90.3 92.4 92.7
0.55
Car 89.8 94.0 94.4
89.8 94.0 94.4
0.45
datasets with Resnet50 as backbone.
Resnet50をバックボーンとするデータセット。
0.68
The results are reported in Table III.
結果は表IIIに報告されています。
0.68
The effect of FBSM: To obtain multiple discriminative part-specific feature representations, we insert FBSMs at the end of stage3, stage4 and stage5 of Resnet50.
With this module, the accuracy of Bird, Aircraft, Car, and Dog increased by 3.4%, 2.1%, 4.2%, and 6.4% respectively, which reflects the effectiveness of the FBSM.
The effect of FDM: When introducing FDM into our approach to model part interaction, the classification results on Bird, Aircraft, Car, and Dog datasets increased by 0.4%, 0.3%, 0.4%, and 0.7% respectively, which indicates the effectiveness of the FDM.
As shown in Fig 5, for each raw image sampled from four datasets, the activation maps at the first to third columns correspond to the third to fifth stages of Resnet50 respectively.
Taking the bird as an example, without FBSMs, the features at different stages all focus on the swing.
鳥を例にとると、FBSMなしでは、異なるステージにおける特徴はすべてスイングに集中します。
0.70
When there are FBSMs, the features in stage3 focus on the swing, the features in stage4 focus on the head, and the features in stage5 focus on the tail.
The visualization experiments prove the capability of FBSMs for mining multiple different discriminative object parts.
可視化実験は、複数の異なる識別対象部品をマイニングするためのFBSMの機能を証明する。
0.62
V. CONCLUSION
V.コンキュレーション
0.76
In this paper, we propose to learn feature boosting, suppression, and diversification for fine-grained visual classification.
本稿では,細粒度視覚分類のための特徴増強,抑制,多様化の学習を提案する。
0.86
Specifically, we introduce two lightweight modules:
具体的には2つの軽量モジュールを紹介します。
0.56
英語(論文から抽出)
日本語訳
スコア
One is the feature boosting and suppression module which boosts the most salient part of the feature maps to obtain the part-specific feature and suppresses it to explicitly force following stages to mine other potential parts.
The other is the feature diversification module which aggregates semantically complementary information from other object parts to each part-specific representation.
REFERENCES [1] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD Birds 200,” California Institute of Technology, Tech.
参考 [1] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona, “Caltech-UCSD Birds 200”, California Institute of Technology, Tech。
0.73
Rep. CNS-TR-2010-001, 2010.
CNS-TR-2010-001、2010年。
0.48
[2] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” in First Workshop on FineGrained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
[2] A. Khosla, N. Jayadevaprakash, B. Yao, L. Fei-Fei, “Novel dataset for fine-grained image categorization” は,2011年6月,コロラドスプリングスで開催されたIEEE Conference on Computer Vision and Pattern Recognition on Colorado Springsの初ワークショップである。
0.86
[3] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi, “Finegrained visual classification of aircraft,” Tech.
3] S. Maji、J. Kannala、E. Rahtu、M. Blaschko、A. Vedaldi、「航空機の粒度のビジュアル分類」技術。
0.82
Rep., 2013. [4] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
2013年、退社。 J. Krause, M. Stark, J. Deng, L. Fei-Fei, “3d object representations for fine-fine categorization” in 4th International IEEE Workshop on 3D Representation and Recognition, Sydney, Australia, 2013
0.73
[5] X. Li, X. Yin, C. Li, X. Hu, P. Zhang, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Objectsemantics aligned pre-training for vision-language tasks,” arXiv preprint arXiv:2004.06165, 2020.
5] X. Li, X. Yin, C. Li, X. Hu, P. Zhang, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, J. Gao, “Oscar: Objectsemanticsalign ed pre-training for vision-language tasks” arXiv:2004.06165, 2020。
0.90
[6] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017.
H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, “Pyramid scene parsing network” in CVPR, 2017
0.78
[7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y]
0.92
Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” Lecture Notes in Computer Science, p. 21–37, 2016.
Fu, and A.C. Berg, "Ssd: Single shot multibox detector", Lecture Notes in Computer Science, p. 21–37, 2016
0.90
[8] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
8] T.Y。 Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, S. Belongie, “Feature pyramid network for object detection” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017”. 2017年5月1日閲覧。
0.84
[9] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” Lecture Notes in Computer Science, p. 834–849, 2014.
N. Zhang, J. Donahue, R. Girshick, T. Darrell, “Part-based r-cnns for fine-fine category detection”, Lecture Notes in Computer Science, pp. 834–849, 2014
0.91
[10] D. Lin, X. Shen, C. Lu, and J. Jia, “Deep lac: Deep localization, alignment and classification for fine-grained recognition,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp.
10] D. Lin, X. Shen, C. Lu, and J. Jia, "Deep lac: Deep Localization, alignment and classification for fine-grained recognition" 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015 pp。 訳抜け防止モード: [10 ]D. Lin, X. Shen, C. Lu, J. Jia, “Deep lac : Deep Localization, alignment” 2015 IEEE Conference on Computer Vision における「微粒化認識のための分類」 and Pattern Recognition (CVPR ) , 2015 , pp。
0.92
1666–1674.
1666–1674.
0.71
[11] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked cnn for finegrained visual categorization,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
11] S. Huang, Z. Xu, D. Tao, Y. Zhang, “Part-stacked cnn for finegrained visual categorization”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016年6月
0.88
[12] S. Branson, G. V. Horn, S. Belongie, and P. Perona, “Bird species categorization using pose normalized deep convolutional nets,” 2014.
12] S. Branson, G. V. Horn, S. Belongie, P. Perona, “Bird species categorization using pose normalized Deep Convolutional nets” 2014年。
0.85
[13] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp.
13] J. Fu、H. Zheng、T. Meiは、2017年のIEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017 pp.で、"Look more to see better: Recurrent attention convolutional neural network for fine-grained image recognition"と題した講演を行った。
0.79
4476–4484.
4476–4484.
0.71
[14] C. Liu, H. Xie, Z. Zha, L. Ma, L. Yu, and Y. Zhang, “Filtration and distillation: Enhancing region attention for fine-grained visual categorization,” in Proceedings of the AAAI Conference on Artificial Intelligence.
14] C. Liu, H. Xie, Z. Zha, L. Ma, L. Yu, and Y. Zhang, “Filtration and distillation: Enhancing region attention for fine-grained visual categorization” 人工知能に関するAAAI会議の進行。
0.87
AAAI Press, 2020, pp.
AAAI Press, 2020, pp。
0.82
11 555–11 562.
11 555–11 562.
0.84
[15] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning to navigate for fine-grained classification,” Lecture Notes in Computer Science, p. 438–454, 2018.
15] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, “Learning to navigate for fine-grained classification”, Lecture Notes in Computer Science, p. 438–454, 2018
0.96
[16] L. Zhang, S. Huang, W. Liu, and D. Tao, “Learning a mixture of granularity-specific experts for fine-grained categorization,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp.
16] L. Zhang、S. Huang、W. Liu、D. Taoは、2019 IEEE/CVF International Conference on Computer Vision (ICCV)、2019、pp. 2019で、「きめ細かいカテゴリ化のための粒度固有の専門家の混合を学ぶ。
0.70
8330–8339.
8330–8339.
0.71
[17] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao, “Selective sparse sampling for fine-grained image recognition,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp.
Y.Ding, Y. Zhou, Y. Zhu, Q. Ye, J. Jiao, “Selective sparse sample for fine-fine image recognition” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp.
0.86
6598–6607.
6598–6607.
0.71
[18] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ser.
18] T.Y。 Lin, A. RoyChowdhury, S. Maji, “Bilinear cnn models for fine-fine visual recognition” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) Ser。
0.79
ICCV ’15. USA: IEEE Computer Society, 2015, p. 1449–1457.
ICCV ’15。 USA: IEEE Computer Society, 2015, pp. 1449–1457。
0.84
[19] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
Y. Gao, O. Beijbom, N. Zhang, T. Darrell, “Compact bilinear pooling” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016年6月 訳抜け防止モード: [19 ] Y. Gao, O. Beijbom, N. Zhang, T. Darrell, “Compact Bilinear pooling, ” 2016 IEEE Conference on Computer Vision” And Pattern Recognition (CVPR ) , Jun 2016 。
0.91
[20] P. Li, J. Xie, Q. Wang, and Z. Gao, “Towards faster training of global covariance pooling networks by iterative matrix square root normalization,” in IEEE Int.
P. Li, J. Xie, Q. Wang, Z. Gao, “Towards faster training of global covariance pooling network by repeaterative matrix square root normalization” in IEEE Int。
0.75
Conf. on Computer Vision and Pattern Recognition (CVPR), June 2018.
Conf computer vision and pattern recognition (cvpr) 2018年6月号。
0.62
[21] S. Cai, W. Zuo, and L. Zhang, “Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp.
21] s. cai, w. zuo, l. zhang, “higher-order integration of hierarchical convolutional activations for fine-grained visual categorization” in 2017 ieee international conference on computer vision (iccv), 2017 pp。
0.76
511–520. [22] S. Kong and C. Fowlkes, “Low-rank bilinear pooling for fine-grained classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
511–520. s. kong and c. fowlkes, “low-rank bilinear pooling for fine-grained classification” in the proceedings of the ieee conference on computer vision and pattern recognition, 2017 pp. ^ (英語)
0.74
365–374. [23] G. Sun, H. Cholakkal, S. Khan, F. Khan, and L. Shao, “Fine-grained recognition: Accounting for subtle differences between similar classes,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol.
365–374. G.Sun, H. Cholakkal, S. Khan, F. Khan, L. Shao, "Fine-grained recognition: Accounting for subtle difference between similar classes" とAAAI Conference on Artificial Intelligence, vol. のProceedingsに記されている。
0.77
34, no. 07, 2020, pp.
34、いいえ。 07, 2020, pp。
0.78
12 047–12 054.
12 047–12 054.
0.84
[24] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-class constraint for fine-grained image recognition,” Lecture Notes in Computer Science, p. 834–850, 2018.
24] M. Sun, Y. Yuan, F. Zhou, and E. Ding, "Multi-attention multi-class constraint for fine-grained image recognition", Lecture Notes in Computer Science, p. 834–850, 2018
0.90
[25] P. Zhuang, Y. Wang, and Y. Qiao, “Learning attentive pairwise interaction for fine-grained classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol.
25] p. zhuang, y. wang, y. qiao, “learning attentive pairwise interaction for fine-grained classification”, aaai conference on artificial intelligence, vol. の議事録。
0.79
34, no. 07, p. 13130–13137, Apr 2020.
34、いいえ。 07,p.13130-13137,Apr 2020。
0.78
[26] W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, and S.-N. Lim, “Cross-x learning for fine-grained visual categorization,” in ICCV, 2019.
W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, S.-N. Lim, “Cross-x learning for fine-grained visual categorization” in ICCV, 2019。
0.88
[27] Y. Gao, X. Han, X. Wang, W. Huang, and M. Scott, “Channel interaction networks for fine-grained image categorization.” in AAAI, 2020, pp.
27] Y. Gao, X. Han, X. Wang, W. Huang, M. Scott, “Channel Interaction Network for fine-grained image categorization.” (AAAI, 2020, pp.)。
0.87
10 818–10 825.
10 818–10 825.
0.84
[28] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.
[28] x. wang, r. girshick, a. gupta, k. he, “non-local neural networks” in the proceedings of the ieee conference on computer vision and pattern recognition, 2018, pp。
0.84
7794–7803.
7794–7803.
0.71
and [29] X. Chen, C. Fu, Y. Zhao, F. Zheng, Y. Yang, “Salience-guided cascaded suppression network for person reidentification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
X. Chen, C. Fu, Y. Zhao, F. Zheng, Y. Yang, “Salience-guided cascadedpression network for person reidentification” in Proceeds of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
0.86
3300–3310.
3300–3310.
0.71
[30] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp.
[30] Y。 Sun, L. Zheng, Y. Yang, Q. Tian, S. Wang, “Beyond part models: person search with refined part pooling (and a strong convolutional baseline)” in Proceedings of the European Conference on Computer Vision (ECCV) 2018, pp。
0.78
480–496. [31] A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
480–496. A.F. Agarap, “Deep Learning using rectified linear units (relu)” arXiv preprint arXiv:1803.08375, 2018。
0.77
[32] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolutional neural network for fine-grained image recognition,” in Proceedings of the IEEE international conference on computer vision, 2017, pp.
32] h. zheng, j. fu, t. mei, j. luo, “learning multi-attention convolutional neural network for fine-grained image recognition” in the proceedings of the ieee international conference on computer vision, 2017 pp. (英語)
0.86
5209–5217.
5209–5217.
0.71
[33] K. He, X. Zhang, S. Ren, and J.
[33]K.He,X.Zhang,S. Ren,J.
0.80
Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
Sun, “Deep Residial Learning for Image Recognition” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016年6月
0.77
[34] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction learning for fine-grained image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.
34] Y. Chen, Y. Bai, W. Zhang, T. Mei, "Destruction and Construction Learning for fine-grained Image Recognition" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp。
0.87
5157–5166.
5157–5166.
0.71
[35] H. Zheng, J. Fu, Z.-J.
35] H. Zheng、J. Fu、Z.-J。
0.85
Zha, and J. Luo, “Learning deep bilinear transformation for fine-grained image representation,” in Advances in Neural Information Processing Systems, 2019, pp.
Zha, and J. Luo, “Learning Deep Bilinear transformation for fine-fine image representation” in Advances in Neural Information Processing Systems, 2019, pp。
0.87
4277–4286.
4277–4286.
0.71
[36] M. Zhou, Y. Bai, W. Zhang, T. Zhao, and T. Mei, “Look-into-object: Self-supervised structure modeling for object recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
[36] M. Zhou, Y. Bai, W. Zhang, T. Zhao, T. Mei, "Look-in-to-object: Self-supervised Structure Model for Object Recognition" は、IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020, pp.) の議題である。
0.78
11 774–11 783.
11 774–11 783.
0.84
[37] J. Deng, W. Dong, R. Socher, L.-J.
[37] J. Deng, W. Dong, R. Socher, L.-J.
0.88
Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database” in CVPR09, 2009
0.95
[38] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
38] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient down with warm restarts” arXiv preprint arXiv:1608.03983, 2016
0.92
[39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
39] K. Simonyan and A. Zisserman, “Very Deep Convolutional Network for Large-Scale Image Recognition” arXiv preprint arXiv:1409.1556, 2014
0.94
[40] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp.
40] C.-Y。 Wu, R. Manmatha, A. J. Smola, P. Krahenbuhl, “Sampling matters in deep embedded learning” in Proceedings of the IEEE International Conference on Computer Vision, 2017 pp。