Recently, crowd density estimation has received increasing attention. The
main challenge for this task is to achieve high-quality manual annotations on a
large amount of training data. To avoid reliance on such annotations, previous
works apply unsupervised domain adaptation (UDA) techniques by transferring
knowledge learned from easily accessible synthetic data to real-world datasets.
However, current state-of-the-art methods either rely on external data for
training an auxiliary task or apply an expensive coarse-to-fine estimation. In
this work, we aim to develop a new adversarial learning based method, which is
simple and efficient to apply. To reduce the domain gap between the synthetic
and real data, we design a bi-level alignment framework (BLA) consisting of (1)
task-driven data alignment and (2) fine-grained feature alignment. In contrast
to previous domain augmentation methods, we introduce AutoML to search for an
optimal transform on source, which well serves for the downstream task. On the
other hand, we do fine-grained alignment for foreground and background
separately to alleviate the alignment difficulty. We evaluate our approach on
five real-world crowd counting benchmarks, where we outperform existing
approaches by a large margin. Also, our approach is simple, easy to implement
and efficient to apply. The code is publicly available at
https://github.com/Y ankeegsj/BLA.
Abstract Recently, crowd density estimation has received increasing attention.
概要 近年,群集密度推定が注目されている。
0.56
The main challenge for this task is to achieve high-quality manual annotations on a large amount of training data.
このタスクの主な課題は、大量のトレーニングデータで高品質な手動アノテーションを実現することである。
0.83
To avoid reliance on such annotations, previous works apply unsupervised domain adaptation (UDA) techniques by transferring knowledge learned from easily accessible synthetic data to real-world datasets.
In this work, we aim to develop a new adversarial learning based method, which is simple and efficient to apply.
本研究は, 簡易かつ効率的に適用可能な, 逆学習に基づく新しい手法を開発することを目的とする。
0.77
To reduce the domain gap between the synthetic and real data, we design a bi-level alignment framework (BLA) consisting of (1) task-driven data alignment and (2) fine-grained feature alignment.
In contrast to previous domain augmentation methods, we introduce AutoML to search for an optimal transform on source, which well serves for the downstream task.
従来のドメイン拡張手法とは対照的に、ソース上の最適な変換を探すためにAutoMLを導入します。
0.56
On the other hand, we do fine-grained alignment for foreground and background separately to alleviate the alignment difficulty.
一方,アライメントの困難さを軽減するために,前景と背景の微粒なアライメントを別々に行う。
0.68
We evaluate our approach on five real-world crowd counting benchmarks, where we outperform existing approaches by a large margin.
Also, our approach is simple, easy to implement and efficient to apply.
また、私たちのアプローチはシンプルで、実装が簡単で、適用も効率的です。
0.65
The code is publicly available at https://github.com/Y ankeegsj/BLA.
コードはhttps://github.com/Y ankeegsj/BLAで公開されている。
0.50
(a) Style Transfer (b) Domain Randomization and Task-Driven Data Alignment
(a)伝法 b)ドメインのランダム化とタスク駆動データアライメント
0.64
Figure 1. Comparison of three different ways for source domain augmentation.
図1に示す。 ソース領域拡張のための3つの異なる方法の比較
0.66
(a). Style transfer translates images from source to a target-like domain based on target style priors, but the translation is usually limited to color changes, and blind to the task objective.
(b). Domain randomization augments the source domain randomly in a more diverse manner (colors, scales, etc.) but without any priors from target; our proposed task-driven data alignment is more similar to domain randomization; but instead of random selection, we pick the most suitable augmentation based on the task objective, which enables a more dynamic and robust model to the target domain.
Thus, it is necessary to investigate how to adapt the models trained on the synthetic domain to the real domain, without requiring annotations on the latter, i.e. via unsupervised domain adaptation (UDA).
There are a few UDA methods proposed for crowd counting.
群衆カウントにはいくつかのUDA手法が提案されている。
0.57
For instance, SE Cycle-GAN [26] translates synthetic images to the real domain with improved CycleGAN and then trains purely on the translated images; Gaussian Process-based iterative learning (GP) [24] generates pseudo-labels on the target images via a Gaussian process to allow for supervised training on the target domain.
More recently, better performance has been achieved by employing an adversarial framework to align features from both source and target domains [6, 8].
However, FSC [8] introduces an auxiliary task of semantic segmentation, relying on external labeled human body segmentation datasets for pre-training; FADA [6] performs a coarse-to-fine estimation, making the inference less efficient.
In this paper, we aim to develop a new adversarial learning based method, which is more effective and flexible.
本稿では,より効果的で柔軟な新しい対角学習手法を開発することを目的とする。
0.69
We investigate the key components to boost performance.
性能向上の鍵となる要素について検討する。
0.56
Previous methods employed either domain randomization or style transfer for source domain augmentation, using no priors or the target style priors only.
In contrast, our taskdriven data alignment is able to control the domain augmentation based on both the target style priors and the task objective such that it is optimized for our crowd counting task on the given target domain.
We show a comparison of three different source domain augmentation methods in Fig. 1.
図1の3つの異なるソース領域拡張手法の比較を示す。
0.68
On the other hand, since the foreground and background regions differ significantly in semantics, we propose a fine-grained feature alignment to handle them separately.
To summarize, the contributions of our work are as follows: (1) For more effective and efficient synthetic to real adaptive crowd counting, we propose a novel adversarial learning based method, consisting of bi-level alignments: task-driven data alignment and fine-grained feature alignment.
(2) To the best of our knowledge, it is the first UDA approach to search for the optimal source data transform based on the downstream task performance on the target domain.
(3) Experimental results on various real datasets show that our method achieves state-of-the-art results for synthetic-to-real domain adaptation; also, our method is simple and efficient to apply.
2. Related Works Since we solve the problem of domain adaptive crowd counting, we first review recent works; our major contribution is a novel domain augmentation method via AutoML,
so we also discuss related works in the above two areas.
上記の2つの領域で 関連する研究も議論します
0.72
2.1. Domain Adaptive Crowd Counting
2.1. ドメイン適応型群衆カウント
0.57
There are two groups of domain adaptation works in crowd counting: real-to-real and synthetic-to-real.
実物から実物へ、合成から実物へという2つのドメイン適応作品群がある。
0.54
Real-toreal adaptation aims to generalize models across real scenarios [9, 16], but since one real-world dataset is taken as the source domain, manual annotations are still needed.
In this work, we focus on synthetic-to-real adaptation.
本研究では,合成から現実への適応に焦点をあてる。
0.54
One direct way is to translate the labeled synthetic images to the style of the real images and then train on the translated images [4, 26], but it is limited by the performance of the translation method.
More recently, the adversarial framework has been leveraged to achieve better performance via feature alignment between source and target domains [6, 8].
However, previous works are not efficient, requiring external training data or additional inference time.
しかし、以前の作業は効率的ではなく、外部のトレーニングデータや追加の推論時間を必要とする。
0.61
For instance, FSC [8] introduces an auxiliary task of semantic segmentation, relying on external labeled human body segmentation datasets for pre-training; FADA [6] performs a coarse-to-fine estimation, making the inference less efficient.
In this work, we leverage adversarial training but aim to develop a simple yet effective method.
本研究では, 対人訓練を活用しながら, シンプルで効果的な手法の開発を目指す。
0.69
2.2. Domain Augmentation
2.2. ドメイン拡張
0.49
Previously, there are two ways to augment the existing source domain: one is domain randomization, randomly changing the style of source images; the other is style transfer, translating the source images to the target style.
On the other hand, AdaIN [13] has demonstrated that the mean and variance of convolutional feature maps can be used to represent the image style, making the domain randomization easier [21].
get ones, e g the amplitude of Fourier transform [30].
フーリエ変換[30]の振幅を例に挙げます。
0.60
In this work, we propose a novel domain augmentation method named task-driven data alignment which is superior than domain randomization and style transfer.
The major difference is that the augmentation is controlled by both the target style priors and by verifying the counting performance on a target-like domain.
Auto machine learning (AutoML) aims to free human practitioners and researchers from selecting the optimal values for each hyperparameter, such as learning rate, weight decay, and dropout [2], or designing well-performing network architectures [1].
Pioneers in this field develop optimization methods to guide the search process based on reinforcement learning (RL) [7], evolutionary algorithm (EA) [29] and Bayesian optimization [18].
These works are often impractical because of the required computational overhead.
これらの作業はしばしば計算オーバーヘッドが要求されるため実行不可能である。
0.59
In contrast, a differentiable controller [20] converts the selection into a continuous hidden space optimization problem, allowing for an efficient search process performed by a gradient-based optimizer.
In this section, we will introduce our bi-level alignment method for cross-domain crowd counting.
本稿では,クロスドメイン・クラウドカウントのための双方向アライメント手法を紹介する。
0.60
Our core idea is to perform alignment between the source and target domains at both data-level and feature-level via two components namely task-driven data alignment and fine-grained feature alignment.
The overall pipeline is depicted in Fig 2 and detailed descriptions are provided in the following.
パイプライン全体は図2に示されており、詳細は以下のとおりである。
0.68
3.1. Problem Formulation
3.1. 問題定式化
0.36
S, yi S)NS For UDA crowd counting, we have an annotated synthetic dataset S = (xi i=1 as source and an unlabeled real-world dataset T = (xi i=1 as target, where T ∈ R3×H×W denote an arbitrary image from the xi S, xi S ∈ RH×W represents the source and target domain, and yi ground truth density map in source.
S,Yi S)NS UDAクラウドカウントでは、注釈付き合成データセットS = (xi i=1 をソースとし、未ラベルの実世界のデータセットT = (xi i=1 をターゲットとし、T ∈ R3×H×W はxi S から任意の画像を表し、xi S ∈ RH×W はソース領域とターゲット領域を表す。
0.53
Our goal is to obtain a model that performs well on the target domain via reducing the large domain gap between the source and target.
At training time, the source dataset S is first transformed to S+, with the same labels; a pair of images (xS+, xT ) from the augmented source and target domains are fed into F, obtaining corresponding feature maps (FS, FT ), FS and FT ∈ RC×h×w; D performs feature alignment by passing reversed gradients to F; in the end, E predicts density maps
トレーニング時に、ソースデータセットSをS+に変換し、同じラベルで、拡張ソースとターゲットドメインから一対の画像(xS+, xT )をFに供給し、対応する特徴写像(FS, FT )、FS および FT ∈ RC×h×w を得る。 訳抜け防止モード: トレーニング時に、ソースデータセットSを同一ラベルでS+に変換し、拡張ソースからの1対の画像(xS+,xT)とターゲットドメインとをFに供給する。 対応する特徴写像 (FS, FT ), FS および FT ∈ RC×h×w を得る ; D は F に逆勾配を渡すことで特徴アライメントを実行する ; 最後に、Eは密度写像を予測する
0.81
(cid:101)yS based on FS, supervised by yS.
(cid:101)ySはFSをベースとし,ySが監督した。
0.57
At test time, the inthrough F and E to obtain the predicted density map(cid:101)yT .
試験時には、インスルーF,Eが予測密度マップ(cid:101)yTを得る。
0.70
ference is rather simple: each target image xi
推測はかなり単純です それぞれのターゲットイメージxiは
0.69
Following previous works, we employ VGG16 [22] as our feature extractor F. For E, we stack a series of convolution and deconvolution layers, inspired by [4].
In contrast to previous domain augmentation methods that are blind to the downstream task, our method searches for the most suitable augmentation based on both the target styles and the task performance on target via AutoML.
It is notable the transform is not limited to the above three units and can be easily extended to different and also more types, with proper manual definitions.
Angle* Table 1. Each transform consists of three different units, each represented by two parameters: one for split ratio and another for attribute.
角度* 表1。 それぞれの変換は3つの異なる単位で構成され、それぞれが2つのパラメータで表される。
0.60
* marks those parameters we search while others are fixed.
※検索対象のパラメータをマークし、他のパラメータを固定します。
0.60
A full transform set is generated by iterating each parameter.
各パラメータを反復してフル変換セットを生成する。
0.82
Given a transform, we split the whole source set into sev-
変換が与えられた場合、ソースセット全体をsevに分割します。
0.53
英語(論文から抽出)
日本語訳
スコア
Figure 2. Overview of our proposed bi-level alignment framework (BLA), which mainly consists of four components: feature extractor (F), density estimator (E), task-driven data alignment and local fine-grained discriminator (D).
At training time, the source dataset S is transformed to S+ with the optimal transform searched via task-driven data alignment (Alg. 1), during which the validation feature generator provides target-like features for candidate transform validation.
T to F and E to obtain the predicted density map(cid:101)yT .
T〜F,Eにより予測密度マップ(cid:101)yTを得る。
0.78
eral subsets via a transform tree, as shown in Fig 3.
図3に示すように、変換木によるエデル部分集合。
0.68
At the 1st level, the whole source dataset is split into two subsets with a ratio of pG, i.e. some images are converted to gray scale images (along path Y), while others are kept the same (along path N).
The search process is iterated via multiple rounds, each of which is described in Alg.
検索プロセスは複数のラウンドを通じて反復され、それぞれがalgで記述される。
0.71
1. At each round, we first transform the source data given some transform candidates, and then obtain the reward of each transform via validation on a generated target-like set; after that, we learn the mapping function from transforms to corresponding rewards via training a differentiable controller; finally, we update the transform candidates based on the controller and goes to the next search round.
Candidate Transform Validation. Based on each new source dataset S+ k , we train the whole network as shown in Fig. 2 with the learning objective from Eq 5.
this should be done by measuring the counting performance on T .
これは、T 上のカウントパフォーマンスを測定することで行うべきである。
0.64
Unfortunately, we do not have labels for T .
残念ながら、Tのラベルはありません。
0.64
To address this problem, we propose a validation feature generator, which takes the features of a pair of source and target image features (FS, FT ) from the feature extractor as input and generate a new feature FV via AdaIN [13], which is a mixture of source contents and target style, namely a target-like image feature.
Specifically, we first compute the source style representation with channel-wise mean and standard deviation µ(FS), σ(FS) ∈ RC and the target style representation µ(FT ), σ(FT ) ∈ RC.
Then we replace the style of FS with that of FT and obtain FV :
そして、FSのスタイルをFTのスタイルに置き換え、FVを得る。
0.64
F c V = µ(F c
F c V = μ(F c)
0.47
T ) + σ(F c
T ) + σ(F c)
0.47
T ) · ( S − µ(F c F c S) σ(F c S)
T) · ( S − μ(F c F c S) σ(F c S)
0.42
), (2) where c ∈ {1, 2, 3,· · ·C} is the channel index.
), (2) ここで c ∈ {1, 2, 3, · · · c} はチャネル指標である。
0.51
After that, we feed FV to the density estimator and get(cid:101)yV .
その後 我々はFVを密度推定器に供給し、(cid:101)yV を得る。
0.58
Because FS and FV share the same contents, we evaluate(cid:101)yV based
FSとFVは同じ内容であるため、(cid:101)yVをベースとした評価を行う。
0.62
on yS. In this way, we obtain the evaluated validation performance pk as reward for transform dk.
yS。 このようにして、変換dkに対する報酬として評価された検証性能pkを得る。
0.42
Candidate Transform Update.
Candidate Transform Update
0.31
After obtaining the reward for each transform in D, we then train a differentiable controller and let it learn the mapping function from a transform to its corresponding reward.
d における各変換に対する報酬を得た後、微分可能なコントローラを訓練し、変換から対応する報酬へのマッピング関数を学習させる。
0.69
The controller is of encoder-decoder structure.
コントローラはエンコーダ・デコーダ構造である。
0.79
The encoder takes a transform as input, maps it to a hidden state, and predicts its perfor-
エンコーダは変換を入力として、それを隠れた状態にマッピングし、その穿孔を予測する。
0.68
mance as(cid:101)pk.
マンス as(cid:101)pk。
0.74
The decoder reconstructs the transform dk as (cid:101)dk from the hidden state.
デコーダは、変換dkを隠れた状態から(cid:101)dkとして再構成する。
0.62
The loss function of our controller (cid:13)(cid:13)(cid :13)dk −(cid:101)dk
制御器の損失関数 (cid:13)(cid:13)(cid :13)dk −(cid:101)dk
0.49
(cid:13)(cid:13)(cid :13)2
(cid:13)(cid:13)(cid :13)2
0.38
LC = + (cid:107)pk −(cid:101)pk(cid:107) 2 .
LC = + (cid:107)pk −(cid:101)pk(cid:107) 2 である。
0.57
(3) is defined as:
(3) 次のように定義されています
0.43
Same with NAO [20], we then update the hidden state towards the gradient direction of improved performance and obtain a new transform set D(cid:48), for better alignment.
After several rounds, we choose the optimal transform from all validated transforms based on their rewards.
数ラウンドの後に、その報酬に基づいて、検証されたすべての変換から最適な変換を選択する。
0.53
Please refer to [20] for more details regarding the update procedure.
更新手順の詳細は[20]を参照してください。
0.76
3.4. Fine-Grained Feature Alignment
3.4. きめ細かい特徴アライメント
0.50
To perform feature alignment, we employ adversarial learning via a discriminator and a gradient reverse layer.
特徴のアライメントを行うために,判別器と勾配逆層による逆学習を用いる。
0.73
Inspired by the success of using segmentation as an auxiliary task for crowd counting [23], we propose a fine-grained discriminator, with two separated classification heads for foreground and background regions.
Given the grid size G = (gh, gw), we feed a pair of feature maps (FS, FT ) to D and obtain two pairs of patch-level discrimination maps: (OF S, OBS) for source and (OF T , OBT ) for target, separating foreground
グリッドサイズ G = (gh, gw) が与えられた場合、D に特徴写像 (FS, FT ) を1対ずつ供給し、2 組のパッチレベルの識別マップを得る: (OF S, OBS) ソースと (OF T , OBT ) ターゲット、フォアグラウンドを分離する。
0.84
Algorithm 1 Pseudo code of one-round search procedure of data-level alignment Input: Source and target domain training set S, T ; the pre-
(4) We use the same back-propagation optimizing scheme with the gradient reverse layer [3] for adversarial learning.
(4) 逆学習には逆勾配層[3]と同じバックプロパゲーション最適化スキームを用いる。
0.58
3.5. Optimization The optimization objective of the whole method is:
3.5. 最適化 全手法の最適化の目的は次のとおりである。
0.57
(5) where λ is a weight factor to balance the task loss LE and the domain adaptation loss LD.
(5) ここで λ はタスク損失 le とドメイン適応損失 ld のバランスをとるための重み係数である。
0.85
L = LE + λLD,
l = le + λld である。
0.58
The whole network is optimized via two steps.
ネットワーク全体が2つのステップで最適化される。
0.70
For data alignment, we first optimize all the parameters in the network including the feature alignment component for each transform to obtain the corresponding reward, during the
4. Experiments We first introduce the datasets, evaluation metrics and implementation details; then we provide comparisons with state-of-the-art methods, followed by analysis on data alignment; finally, we perform some ablation studies.
To evaluate the proposed method, the experiments are conducted under adaptation scenarios from GCC [26] to five large-scale real-world datasets, i.e. ShanghaiTech Part A/B (SHA/SHB) [33], QNRF [15], UCF-CC-50 [14] and WorldExpo’10 [32] respectively.
(MAE) as Following previous works, we adopt Mean AbSquared Error formu-
(MAE) として 先行研究に続いて,平均absquared error formu を採用する。
0.50
(cid:80)N i=1 |(cid:80) yi −(cid:80)(cid:101)yi| , and (cid:80)N i=1 |(cid:80) yi −(cid:80)(cid:101)yi|2, where N is the number of test images; (cid:80) yi, (cid:80)(cid:101)yi represent the ground
(cid:80)n i=1 |(cid:80) yi −(cid:80)(cid:101)yi| , and (cid:80)n i=1 |(cid:80) yi −(cid:80)(cid:101)yi|2, ここで n はテスト画像の数である。
0.93
evaluation metrics. truth and predicted number on the i-th image respectively.
評価指標。 第i画像の真理と予測数をそれぞれ算出する。
0.65
4.2. Implementation Details
4.2. 実施内容
0.38
They are M SE =
彼らは M SE =
0.59
(cid:113) 1 N
(cid:113) 1N
0.39
1 N The architectures of feature extractor (F), density estimator (E), fine-grained discriminator (D) and controller (C) are listed in supplementary.
1N 補足的に特徴抽出器(f)、密度推定器(e)、細粒度判別器(d)、制御器(c)のアーキテクチャを列挙する。 訳抜け防止モード: 1N 特徴抽出器(F)、密度推定器(E)のアーキテクチャ fine - きめ細かい識別器(D)とコントローラ(C)は補足的にリストされる。
0.53
We input 4 pairs of source and target images with a uniform size of 576 × 768 at each iteration.
各イテレーションで576×768の均一なサイズで4対のソース画像とターゲット画像を入力する。
0.73
Following the previous work [4], we generate the ground truth density map using Gaussian kernel with a kernel size of 15× 15 and a fixed standard deviation of 4.
We compare our method BLA with previous published unsupervised domain adaptive crowd counting methods under the adaptation scenarios from synthetic GCC dataset to five different real-world datasets.
From the results in Tab. 3, we have the following observations: (1) Our proposed method outperforms all existing domain adaptation methods by a large margin across different datasets and on WorldExpo’10 we achieve comparable results with DACC.
In particular, on SHA our proposed method achieves 99.3 MAE and 145.0 MSE, outperforming previous best results by 13.1 pp w.r.t. MAE and 31.9 pp w.r.t. MSE.
For instance, GCC contains highly-saturated color images while UCF-CC-50 and WorldExpo’10 contain lots of images with low saturations, so the RGB2Gray ratios on them are rather high (0.85 and 0.98) as gray scale images are of 0 saturation; in contrast, SHB is closer to GCC in terms of saturation, so the RGB2Gray ratio on SHB is rather low (0.16).
Similarly, since UCF-CC-50 is denser than other datasets, its scaling factor is particularly smaller, such that denser regions with small-scale heads will be generated.
It is necessary to search for the most suitable transform on each dataset in an automatic way so as to avoid tedious manual designs.
面倒な手動設計を避けるために、各データセット上で最も適切な変換を自動で検索する必要がある。
0.72
Additionally, Fig 4 shows some qualitative results on the SHA dataset.
さらに、図4はshaデータセットに定性的な結果を示している。
0.63
From Column 3, we can see that without adaptation, the model either fails to detect the presence of people in some areas (top row), or fails to get a correct estimate of the local density (middle and bottom rows).
As shown in Tab. 6, the performance is significantly improved by ∼25 pp w.r.t MAE (from 134.7 to 109.1) when task-driven data alignment is employed.
タブに示すように。 6 pp w.r.t mae (134.7 から 109.1) のタスク駆動データアライメントで性能が大幅に向上した。
0.64
On the other hand, we also observe a large improvement of ∼14 pp w.r.t MAE (from 134.7 to 121.1) by replacing global feature alignment with fine-grained feature alignment.
Moreover, we obtain a total gain of ∼35 pp w.r.t MAE by adding both alignments.
さらに、両アライメントを加算することにより、合計で35 pp w.r.t MAEを得る。
0.61
These results indicate the effects of two levels of alignment and the complementarity between them.
これらの結果は2段階のアライメントの効果とそれらの相補性を示している。
0.71
Task-Driven Data Alignment
タスク駆動 データアライメント
0.66
× (cid:88) × (cid:88)
× (cid:88)× (cid:88)
0.40
Fine-Grained Feature Alignment MAE MSE 210.9 153.8 200.8 145.0
細粒 特徴アライメント MAE MSE 210.9 153.8 200.8 145.0
0.36
134.7 109.1 121.1 99.3
134.7 109.1 121.1 99.3
0.23
× × (cid:88) (cid:88)
× × (cid:88) (cid:88)
0.41
Table 6. Effects of two levels of alignment.
表6。 2段階のアライメントの効果。
0.50
Effect of Task-Driven Data Alignment To evaluate the effectiveness of task-driven data alignment, we replace it with domain randomization and style transfer.
In Fig 5, we compare the counting performance on our generated validation set and the real target training set w.r.t. MAE, where the index indicates different combinations of transform parameters.
We can see that the two curves go in a similar trend, i.e. the worst performance happens at index 0, the best performance happens at index 7, and there is fluctuation in between.
This comparison demonstrates that our generated validation set is of high similarity to the real target set, allowing us to do effective validation without relying on target annotations.
On the other hand, we observe the performance varies a lot along the choice of transform parameters, showing that different transforms highly affect the performance.
Thus it is of great importance to search for an optimal transform for a given target set.
したがって、与えられた対象集合の最適変換を探すことは非常に重要である。
0.74
Impact of Grid Size in Fine-Grained Feature Alignment.
細粒度特徴アライメントにおけるグリッドサイズの影響
0.77
Our fine-grained feature alignment strategy is conducted in a patch-wise style.
私たちの細かい機能アライメント戦略はパッチワイドなスタイルで行われます。
0.70
We analyze how the grid size
グリッドのサイズをどう分析するか
0.65
Figure 5. Comparison of validation performance on the generated set (left) and real target set (right) across different transforms w.r.t. MAE.
図5。 生成した集合(左)と実ターゲット集合(右)の異なる変換 w.r.t. MAE における検証性能の比較。
0.77
The similar trend verifies the effect of our validation feature generator.
同様の傾向は、検証機能ジェネレータの効果を検証する。
0.69
G affects the performance.
Gはパフォーマンスに影響する。
0.76
As shown in Tab. 8, if the size G is too small or too large, there will be a data imbalance between the numbers of background and foreground patches, which results in poor feature alignment.
Since the patch size of 16 performs the best, we use 16 as the default size in all experiments.
16のパッチサイズがベストなので、すべての実験でデフォルトサイズとして16を使用します。
0.83
Grid size G MAE MSE 222.9 205.4 206.3 200.8 223.6
格子サイズG MAE MSE 222.9 205.4 206.3 200.8 223.6
0.58
(2,2) (4,4) (8,8) (16,16) (32,32)
(2,2) (4,4) (8,8) (16,16) (32,32)
0.36
138.1 130.5 129.0 121.1 141.4
138.1 130.5 129.0 121.1 141.4
0.22
Table 8. Impact of local grid size G used in fine-grained feature alignment.
表8。 きめ細かな特徴アライメントに用いる局所格子サイズGの影響
0.52
In the supplementary material, we provide more ablation studies on the impact of segmentation threshold, effect of additional style transfer from S+ to T , effect of using more transformations, and a comparison to grid search.
補足資料では, セグメンテーション閾値の影響, s+ から t への付加的スタイル移動の影響, より多くの変換の効果, グリッド探索との比較について, さらなるアブレーション研究を行っている。
0.80
5. Conclusion In this work, we propose a bi-level alignment framework for synthetic-to-real UDA crowd counting.
On one hand, we propose task-driven data alignment to search for a specific transform given the target set, which is applied on the source data to narrow down the domain gap at the data level.
On the other hand, to alleviate the alignment difficulty on the entire image, we propose to perform fine-grained feature alignment on foreground and background patches separately.
Extensive experiments on five real-world crowd counting benchmarks have demonstrated the effectiveness of our contributions.
5つの実世界のクラウドカウントベンチマークに関する広範な実験が、我々の貢献の有効性を実証しました。
0.53
Acknowledgements This work was supported in part by the “111” Program B13022, Fundamental Research Funds for the Central Universities (No. 30920032201) and the National Natural Science Foundation of China (Grant No. 62172225).
Very deep convolutional networks for large-scale image recognition.
大規模画像認識のための深層畳み込みネットワーク
0.78
arXiv, 2014.
2014年、arxiv。
0.55
3, 6, 7 [23] Vishwanath A Sindagi and Vishal M Patel.
3, 6, 7 Vishwanath A Sindagi and Vishal M Patel.[23] Vishwanath A Sindagi and Vishal M Patel.
0.38
Inverse attention guided deep crowd counting network.
逆の注意 ディープ・クラウド・カウント・ネットワーク
0.50
pages 1–8, 2019.
2019年 - 8頁。
0.73
5 [24] Vishwanath A Sindagi, Rajeev Yasarla, Deepak Sam Babu, R Venkatesh Babu, and Vishal M Patel.
5 Vishwanath A Sindagi, Rajeev Yasarla, Deepak Sam Babu, R Venkatesh Babu, Vishal M Patel。 訳抜け防止モード: 5 24 ]ヴィシュワナト・ア・シンダギ、ラジェエフ・ヤサラ、ディープク・サム・バブ ベンカテシュ・バブ(venkatesh babu)とヴィシャール・m・パテル(vishal m patel)。
0.44
Learning to count in the crowd from limited labeled data.
限られたラベル付きデータから、群衆を数えることを学ぶ。
0.66
In ECCV, pages 212– 229, 2020.
ECCVでは、2020年212-229頁。
0.76
2, 7 [25] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.