We present TIPSy-GAN, a new approach to improve the accuracy and stability in
unsupervised adversarial 2D to 3D human pose estimation. In our work we
demonstrate that the human kinematic skeleton should not be assumed as one
spatially codependent structure. In fact, we believe when a full 2D pose is
provided during training, there is an inherent bias learned where the 3D
coordinate of a keypoint is spatially codependent on the 2D locations of all
other keypoints. To investigate our theory we follow previous adversarial
approaches but train two generators on spatially independent parts of the
kinematic skeleton, the torso and the legs. We find that improving the 2D
reprojection self-consistency cycle is key to lowering the evaluation error and
therefore introduce new consistency constraints during training. A TIPSy is
produced model via knowledge distillation from these generators which can
predict the 3D coordinates for the entire 2D pose with improved results.
Furthermore, we address the question left unanswered in prior work detailing
how long to train for a truly unsupervised scenario. We show that two
independent generators training adversarially has improved stability than that
of a solo generator which will collapse due to the adversarial network becoming
unstable. TIPSy decreases the average error by 18% when compared to that of a
baseline solo generator. TIPSy improves upon other unsupervised approaches
while also performing strongly against supervised and weakly-supervised
approaches during evaluation on both the Human3.6M and MPI-INF-3DHP dataset.
In our work we demonstrate that the human kinematic skeleton should not be assumed as one spatially codependent structure.
本研究では,人間の運動骨格を空間的共依存構造として捉えるべきではないことを実証する。
0.66
In fact, we believe when a full 2D pose is provided during training, there is an inherent bias learned where the 3D coordinate of a keypoint is spatially codependent on the 2D locations of all other keypoints.
To investigate our theory we follow previous adversarial approaches but train two generators on spatially independent parts of the kinematic skeleton, the torso and the legs.
We find that improving the 2D reprojection self-consistency cycle is key to lowering the evaluation error and therefore introduce new consistency constraints during training.
A TIPSy is produced model via knowledge distillation from these generators which can predict the 3D coordinates for the entire 2D pose with improved results.
Furthermore, we address the question left unanswered in prior work detailing how long to train for a truly unsupervised scenario.
さらに,教師なしシナリオのトレーニングに要する時間について,先行研究で未解決の疑問に対処する。
0.62
We show that two independent generators training adversarially has improved stability than that of a solo generator which will collapse due to the adversarial network becoming unstable.
TIPSy decreases the average error by 18% when compared to that of a baseline solo generator.
TIPSyは、ベースラインのソロジェネレータと比べて平均誤差を18%減少させる。
0.64
TIPSy improves upon other unsupervised approaches while also performing strongly against supervised and weakly-supervised approaches during evaluation on both the Human3.6M and MPI-INF-3DHP dataset.
Code and weights of our model will be made available
私たちのモデルのコードと重み付けが利用可能になる
0.79
1 Introduction The ability to generate accurate 3D human skeletons from images and video has extensive applications in security, human-robot interaction, interactive media and healthcare [41] [8] [22].
Estimating a 3D pose from a single monocular image however, is an ill-posed inverse problem as multiple different 2D poses can correspond to the same 3D pose.
Unsupervised adversarial approaches [5] [16] [47] have sought to remedy this by exploiting the abundance of readily available 2D image and video data of humans.
Through the use of self or temporal consistency and a 2D pose discriminator, they help to reduce the barrier of entry for 3D HPE while improving generability to in the wild scenarios.
Our research aims to reduce this discrepancy and address what we believe is a flaw in all adversarial 2D to 3D HPE models, that the human kinematic skeleton should be treated as one independent structure.
We believe that minimising the predictive error on the entire 2D skeleton induces correlations between a keypoints 3D coordinate and all of the skeletons other 2D keypoint coordinates.
Thus, for example, the 3D prediction for the left wrist would contain some component correlating to the 2D coordinate of the right knee.
例えば、左手首の3d予測には、右膝の2d座標に関連するいくつかの成分が含まれている。
0.69
We instead train multiple generators on spatially independent parts of the 2D kinematic skeleton.
代わりに、2Dキネマティック骨格の空間的に独立な部分に複数の発電機を訓練する。
0.53
The knowledge acquired is then distilled to an end-to-end model which can predict the 3D coordinates for an entire 2D pose, giving the framework its name "Teaching Independent Parts Seperately" (TIPSy).
得られた知識はエンドツーエンドのモデルに蒸留され、2次元のポーズ全体の3D座標を予測し、フレームワークに"Teaching Independent Parts Seperately" (TIPSy) という名前を与える。
0.80
In this paper we build upon [5] and [48] as well as show the stability of our model during training, highlighting that in a truly unsupervised scenario using spatially independent generators would allow for more optimum model to be created, even when no 3D data is accessible.
Additionally we introduce three new self-consistency constraints during the adversarial learning cycle which we found to help improve evaluation metrics.
さらに,評価指標の改善に役立てる対人学習サイクルにおいて,新たな自己整合性制約を3つ導入した。
0.75
2 Related Work 2.1
2 関連作業 2.1
0.54
3D Human Pose Estimation
人間の3次元姿勢推定
0.67
There currently exists two main avenues of deep-learning for 3D HPE.
現在、3D HPEのためのディープラーニングの道は2つある。
0.58
The first avenue learns the mapping of 3D joints directly from a 2D image [29] [19] [24] [18] [35] [36] [37].
2.2 Fully Supervised Fully supervised approaches seek to learn mappings from paired 2D-3D data which contain ground truth 2D locations of keypoints and their corresponding 3D coordinates.
Jiang et al [14] introduced an exemplar approach that split their 3D dictionary into torso and legs aiming to speed up the nearest-neighbour search process, whereas we split up our poses during training to reduce bias and learn a better 3D mapping.
Pavllo et al [31] used temporal convolutions over 2D keypoints in order to predict the pose of the central or end frame in a time series, whereas Mehta et al [25] utilised multi-task learning to combine a convolutional pose regressor with kinematic skeleton fitting for real time 3D pose estimation.
pavllo et al [31] は2dキーポイント上の時間的畳み込みを用いて時系列における中央または終端フレームのポーズを予測するが、mehtaらはマルチタスク学習を用いて畳み込みポーズ回帰器とキネマティックスケルトンフィッティングを組み合わせてリアルタイム3dポーズ推定を行った。
0.83
Luo et al [21] introduced a fully convolutional approach which modelled 3D joint orientations with 2D keypoint detections.
Park et al [28] and Zeng et al [48] introduced the concept of splitting a pose into localised groups during learning, where they assume that an unseen pose may be compose of local joint configurations that appear in different poses within a training set.
Park et al [28] と Zeng et al [48] は、学習中にポーズを局所化されたグループに分割するという概念を導入した。 訳抜け防止モード: park et al [28 ]とzeng et al [48 ]は、学習中にポーズを局所的なグループに分割するという概念を導入した。 彼らはこう考えています 見えないポーズは、トレーニングセット内で異なるポーズで現れるローカルなジョイント構成で構成されてもよい。
0.67
Unlike our approach however, they still maintain that an entire 2D pose is one independent structure via feature sharing or averaging between localised groups throughout their network.
We argue that no feature sharing is required and these localised groups can be assumed to be completely independent from one another.
機能共有は不要であり、これらの局所化グループは互いに完全に独立していると考えることができる。
0.71
Additionally, we distill the knowledge from our sub-networks to an end-to-end network which is both more computationally efficient and our approach better generalises to unseen poses.
2.3 Weakly Supervised Weakly-Supervised approaches do not use explicit 2D-3D correspondences but use either augmented 3D data during training or unpaired 2D-3D data to learn human body priors (shape or articulation).
Pavlakos [30] et al and Ronchi et al [33] proposed the learning of 3D poses from 2D with ordinal depth relationships between keypoints (e g the right wrist is behind the right elbow).
Pavlakos [30] et al と Ronchi et al [33] はキーポイント間(例えば右手首は右肘の後ろ)の順序的深度関係を持つ2Dから3Dポーズを学ぶことを提案した。
0.84
Wandt et al [39] introduced a weakly-supervised adversarial approach which transformed their predicted and ground truth 3D poses into a kinematic space chain prior to being seen by a Wasserstein critic [10].
Wandt et al [39] は、Wasserstein 批判者[10] が見る前に、予測的および基底的真理 3D のポーズをキネマティックな空間連鎖に変換する弱教師付き対向アプローチを導入した。
0.68
Yang et al [46] lifted wild 2D poses where no ground truth data is available with a critic network that compared these against existing 3D skeletons.
Yang et al [46]は、既存の3Dスケルトンと比較した批評家ネットワークで、根拠となる真実データを入手できない野生の2Dポーズを持ち上げた。 訳抜け防止モード: Yang et al [46 ] lifted wild 2D poses where 既存の3Dスケルトンと 比較する 批評家ネットワークで 真実のデータを入手できない
0.71
Zhou et al [49] utilised transfer learning, using mixed 2D and 3D labels in a unified network.
Zhou et al [49]は、2Dラベルと3Dラベルを混在したネットワークを用いてトランスファー学習を行った。
0.68
Drover et al [6] investigated if 3D poses can be learned through 2D self-consistency alone, where they found a 2D pose critic network was also needed.
Drover et al [6]は、2Dの自己整合だけで3Dのポーズが学べるかどうかを調査し、そこでは2Dのポーズ批判ネットワークも必要だった。
0.61
2.4 Unsupervised Unsupervised approaches do not utilise any 3D data during training, unpaired or otherwise.
2.4 監督なし 教師なしのアプローチでは、トレーニング中の3Dデータを使用しない。
0.58
Kudo et al. [16] introduced one of the first unsupervised adversarial networks utilising random re-projections and a 2D critic network, under the assumption that any predicted 3D pose once rotated and reprojected should still produce a believable 2D pose.
Yu et al [47] built upon Chen et al [5] highlighting that temporal constraints may hinder a models performance due to balancing multiple training objectives simultaneously and proposed splitting the problem into both a lifting and scale estimation module.
Yu et al [47] built on Chen et al [5] では、複数のトレーニング目標を同時にバランスさせることによって、時間的制約がモデルのパフォーマンスを妨げる可能性があることを強調し、問題をリフトとスケールの両方の見積モジュールに分割することを提案した。 訳抜け防止モード: Yu et al [47 ] built on Chen et al [ 5 ] highlighting temporal constraints may hinder a model performance due due due by the multiple training objectives。 問題をリフトとスケールの両方に分割することを提案したのです
0.79
They also found that adding temporal motion consistency can boost the performance of their model by 6%.
また、時間的動きの一貫性を加えることで、モデルの性能が6%向上することを示した。
0.66
Similar to [47] we highlight that another issue may lie within the lifting network which could also benefit from being split into two sub-networks that predict upper and lower body keypoints.
2.5 Knowledge Distillation Knowledge distillation is a model compression technique where knowledge is transferred from multiple or one large model (teacher) to a smaller model (student) [11].
Tripathi et al [38] investigated if knowledge could be distilled across 3D representations, where a teacher network would learn 3D kinematic skeletons from 2D poses, then distill this knowledge to a student network that would predict skinned multi-person linear model (SMPL) [20] representations of 3D poses.
Tripathi et al [38]は、教師ネットワークが2Dポーズから3Dキネマティック骨格を学習し、その知識を3Dポーズのスキン付き多人線形モデル(SMPL)[20]表現を予測する学生ネットワークに蒸留する3D表現で知識を蒸留できるかどうかを調査した。
0.83
Lastly, Xu et al. [44] proposed an unsupervised approach where a self-consistent teacher with 2D pose-dictionary based modelling would distill knowledge to a student utilising graphical convolutions to improve estimation accuracy and flexibility.
3 Method In this section we describe both our adversarial approach to train our 2D to 3D generators, as well as our knowledge distillation approach for our final TIPSy model.
Therefore, we used max-normalisation on each of our 2D poses to scale their 2D coordinates between -1 and 1.
そこで,各2次元ポーズの最大正規化を用いて,2次元座標を-1から1に拡張した。
0.67
This also constrains the range of possible 3D coordinates for these keypoints between -1 and 1, allowing the final function of our generators to be a bounded activation function which helps improve adversarial learning [32].
Though feature selection [12] [43] can be used to find an optimal amount of spatially independent segments to split a 2D pose into, for simplicity we split our pose up into two during training, the torso and legs.
The leg generator similarly accepted a vector of 2D keypoints consisting of the ankles, knees and hips, with the root keypoint omitted during training as this was a constant.
Once both of our generators had made their predictions they were concatenated and combined with the original 2D keypoints to create our final predicted 3D pose (x, y, ˆz).
両方のジェネレータが予測を下すと、元の2Dキーポイントと結合して、最終的な予測された3Dポーズ(x, y, yz)を作成しました。
0.70
Our final TIPSy generator by contrast accepts all N keypoints as input and would predict the 3D locations for the full human pose.
Similar to prior work [5] [6] [47], we utilise a self-consistency cycle through random 3D rotations to reproject our predicted 3D poses to new synthetic 2D viewpoints.
Let Y ∈ RN×2 be a matrix containing the 2D keypoints from which our generators G will predict.
Y ∈ RN×2 を生成元 G が予測する 2D 鍵点を含む行列とする。
0.75
Once a prediction G(Y) is made and 3D pose obtained, a random rotation matrix R will be created by uniformly sampling an azimuth angle between [−π, π] and an elevation angle between [−π 18].
The predicted 3D pose will be rotated by this matrix and reprojected back via projection P into a new synthetic viewpoint, obtaining the new 2D matrix ˜Y where ˜Y = PR[G(Y)].
予測された3Dのポーズは、この行列によって回転し、射影 P を介して新たな合成視点に再投影され、新しい 2D 行列 >Y を得る。
0.70
Providing our model is consistent, if we now provide ˜Y as input to our generators, perform the inverse rotation R−1 on the newly predicted 3D pose G( ˜Y) and reproject it back into 2D, we should obtain our original matrix of 2D keypoints Y. This cycle allows our model to learn self-consistency during training where it seeks to minimise the following component in the loss function:
||2 is the sum of the squares of all matrix entries and N is the amount of keypoints predicted.
||2 はすべての行列成分の平方の和であり、N は予測されるキーポイントの量である。
0.71
Note that as we are training two generators independently from one another, both generators will receive their own L2D loss based on the error between the keypoints that they predicted for.
As an example, part of the L2D loss for our torso generator would include the difference between the original 2D keypoint location of the right wrist, and its 2D location once ˜Y3D was inversely rotated
Let (x, y, ˆz) be the predicted 3D pose from our model.
x, y, z) をモデルから予測された3dポーズとする。
0.73
If we assume a fixed camera position and rotate our pose 90◦, then the depth component of our pose (ˆz) prior to rotation will now lie on the x axis from our cameras viewpoint.
固定されたカメラの位置を仮定してポーズを90度回転させると、回転前のポーズの深さ成分はカメラの視点から x 軸上に置かれる。
0.74
A visual example of this can be seen in Figure 2.
この視覚的な例を図2に示します。
0.70
Figure 2: Showing that a 90◦ rotation of a 3D pose around the y axis with a fixed camera position, will result in the x axis values of the pose prior to rotation representing the z axis values of the pose after the rotation and vice versa.
These constraints are summed in the final loss function to produce L90◦.
これらの制約は最終損失関数にまとめて L90 を生成する。
0.80
Similar to reprojection consistency, as our generators are making predictions independent from one another, each will receive its own version of L90◦ based on the keypoints that they predicted for.
Although we could have included three similar constraints for 90◦ rotations around the x axis, we found that these hinder the performance of the model.
Therefore we utilise a 2D discriminator D, that takes as input a 2D pose and outputs a value between 0 and 1, representing the probability of the pose being believable.
そこで、入力a2Dのポーズとして2D判別器Dを利用し、そのポーズの確率を表す0〜1の値を出力する。
0.68
The architecture of our discriminator was a fully connected neural network with the same structure as our generators, but containing one fewer residual blocks and a softmax function in place of Tanh.
Our discriminator utilised the standard GAN loss [9]:
我々の識別器は標準GAN損失[9]を利用した。
0.62
5
5
0.42
英語(論文から抽出)
日本語訳
スコア
min G max D
ミン G マックス D
0.51
Ladv = E(log(D(Y))) + E(log(1 − D( ˜Y)))
Ladv = E(log(D(Y))) + E(log(1 − D( >Y)))
0.44
(6) Unlike the consistency constraints, we do not provide a unique version of Ladv to the torso and leg generator and instead provide the same loss (with a different weight) to both generators.
This is due to two reasons; Firstly we wanted our generators to produce a believable pose together which would in turn allow TIPSy to produce a believable pose by itself when knowledge was distilled.
3.4 Knowledge Distillation In the final step of our process, and the production of our TIPSy model, is to distill the knowledge of our leg and torso generator to an end-to-end generator.
We use MSE for knowledge distillation as we found that training GT IP Sy adversarially while including a divergence metric as an additional constraint, would lead to worse performance than if it simply tried to match the leg and torso generators’ predictions.
我々は知識蒸留にMSEを用い、GT IP Syのトレーニングを逆向きに行うが、分岐距離を追加の制約として含めれば、単に脚や胴体発生器の予測に合致させようとするよりもパフォーマンスが悪くなることを示した。
0.69
3.5 Training As discussed, our torso and leg generators were trained adversarially with rotational and 90◦ consistency and our TIPSy generator was trained using knowledge distillation.
The network parameters are then updated to optimise the total loss for each generator given by:
次にネットワークパラメータが更新され、次のジェネレータの合計損失を最適化する。
0.79
Lleg = w1Ladv + w2Lleg2D + w3Lleg90◦
Lleg = w1Ladv + w2Lleg2D + w3Lleg90
0.28
Ltorso = w4Ladv + w2Ltorso2D + w3Ltorso90◦
ltorso = w4ladv + w2ltorso2d + w3ltorso90 である。
0.35
LT IP Sy =
LT IP Sy =
0.42
1 N i=1(GT IP Sy(xi, yi) − ˆzi)2 ΣN
1N i=1(GT IP Sy(xi, yi) − szi)2 ΣN
0.44
(8) (9) (10)
(8) (9) (10)
0.42
where, w1 = 0.05, w2 = 10, w3 = 3 and w4 = 0.08 are the relative weights for the leg generators adversarial loss, both generators self-consistency loss, both generators 90◦ consistency loss and torso generators adversarial loss respectively.
The discrepancy between w1 and w4 was due to how many points each generator predicted.
w1とw4の差は、各発電機が予測した点数による。
0.69
Our torso generator predicted 10 z values out of the 16 in the full pose, meaning they predicted 10 16 of the entire pose.
私たちのトルソ発生器は16のうち10のz値を全ポーズで予測し、全ポーズの10の16を予測しました。
0.70
Therefore any change in adversarial loss would be more likely due to the torso generator than the leg generator and its weight is higher to reflect this.
したがって、対向的な損失の変化は、脚発生器よりも胴体発生器が原因で起こり、その重みが反映される。
0.70
We trained our model completely unsupervised following [5].
私たちは[5]に従って完全に教師なしのモデルを訓練した。
0.46
For all models we used a batch size of 8192 and the Adam optimiser with a learning rate of 0.0002.
すべてのモデルでバッチサイズは8192で、Adamオプティマイザは0.0002で学習しました。
0.64
Our experiments use N = 16 keypoints.
実験では、N = 16キーポイントを使用します。
0.58
For evaluation we show results of a solo generator as a baseline, both the leg and torso generator working together, and our final TIPSy model trained via knowledge distillation.
It consists of both video and motion capture (MoCap) data from 4
ビデオとモーションキャプチャー(MoCap)のデータからできています。
0.68
6
6
0.43
英語(論文から抽出)
日本語訳
スコア
viewpoints of 5 female and 6 male subjects performing specific actions (e g talking on the phone, taking a photo, eating, etc.).
特定の行動を行う5人の女性と6人の男性の視点(例えば、電話で話したり、写真を撮る、食べる、など)。
0.80
There are two main evaluation protocols for the Human3.6M dataset, which use subjects 1, 5, 6, 7 and 8 for training and subject 9 and 11 for evaluation.
Both protocols report the Mean Per Joint Position Error (MPJPE), which is the Euclidean distance in millimeters between the predicted and ground truth 3D coordinates.
We report the protocol-II performance of our model which employs rigid alignment between the ground truth and predicted pose prior to evaluation.
基礎的真理と予測的ポーズの厳密な整合を用いたモデルprotocol-iiの性能を評価前に報告する。
0.66
Our results can be seen in Table 1.
結果が表1で確認できます。
0.65
As we can see, by interpreting a 2D pose as multiple spatially independent sections for the purpose of 3D pose estimation, we can significantly improves results.
Additionally, TIPSy managed to achieve the highest performance in both the photo taking and sitting down action as well as joint second highest in the sitting action.
By analysing videos of these actions, we believe that TIPSy improved these scenarios specifically due to the subjects moving their arms freely throughout the scene but having a fairly neutral stance for their legs (examples of this can be seen in Appendix A.2) highlighting the benefit of treating them as independent from one another.
(GT) denotes providing 2D ground truth keypoints to a lifting model.
(GT)は、リフトモデルに2次元の真理キーポイントを提供することを意味する。
0.62
(T) denotes the use of temporal information.
(T)は時間情報の使用を意味する。
0.85
All results are taking from their respective papers.
結果はそれぞれの論文から取っています。
0.81
Lower is better, best in bold, second best underlined.
低い方が良く、大胆で、二番目に下品です。
0.62
Method Approach Martinez et al [23] Supervised Pavllo et al [31] (GT) Supervised Cai et al [2] (GT) Supervised Yang et al [46] (+) Weakly-Supervised Pavlakos et al [30] (+) Weakly-Supervised Ronchi et al [33] Weakly-Supervised Wandt et al [39] (GT) Weakly-Supervised Drover et al [6] (GT)(+) Weakly-Supervised Kudo et al [16] (GT) Unsupervised Chen et al [5] (T) Unsupervised Solo Generator (Ours)(GT) Unsupervised Leg and Torso Generator (Ours)(GT) Unsupervised TIPSy (Ours)(GT)
Method Approach Martinez et al [23] Supervised Pavllo et al [31] (GT) Supervised Cai et al [2] (GT) Supervised Yang et al [46] (+) Weakly-Supervised Pavlakos et al [30] (+) Weakly-Supervised Ronchi et al [33] Weakly-Supervised Wandt et al [39] (GT)(+) Weakly-Supervised Drover et al [6] (GT)(+) Weakly-Supervised Kudo et al [16] (GT) Unsupervised Chen et al [5] (T) Unsupervised Solo Generator (Ours (GT)(GT)(GT)() Unsupervised Sollo Generators (Our Legs (Ours (GT)(GT)() Unsupervised) Unsuperviseded Tors (Ours (GT)(T)(T)
Approach Method Martinez et al [23] Supervised Pavllo et al [31] (GT) Supervised Cai et al [2] (GT) Supervised Yang et al [46] (+) Weakly-Supervised Pavlakos et al [30] (+) Weakly-Supervised Ronchi et al [33] Weakly-Supervised Wandt et al [39] (GT) Weakly-Supervised Drover et al [6] (GT)(+) Weakly-Supervised Kudo et al [16] (GT) Unsupervised Chen et al [5] (GT)(T) Unsupervised Solo Generator (Ours)(GT) Unsupervised Leg and Torso Generator (Ours)(GT) Unsupervised TIPSy (Ours)(GT)
Approach Method Martinez et al [23] Supervised Pavllo et al [31] (GT) Supervised Cai et al [2] (GT) Supervised Yang et al [46] (+) Weakly-Supervised Pavlakos et al [30] (+) Weakly-Supervised Ronchi et al [33] Weakly-Supervised Wandt et al [39] (GT) Weakly-Supervised Drover et al [6] (GT)(+) Weakly-Supervised Kudo et al [16] (GT) Unsupervised Chen et al [5] (GT)(GT)( Unsupervised Solo Generator (Ourgelos (Ourges (GT)(Ourget)(Ourge) and Unsupervised Generators (Ourges(GT)(GT)())
As our predicted poses are normalised, we scale them up by their original normalising factor prior to evaluation.
予測されたポーズが正規化されると、評価の前に元の正規化係数でそれらをスケールアップします。
0.60
Additionally [39] found there are ambiguities between multiple cameras and 3D pose rotations, causing the potential for inverted predictions as seen in [16].
To remove this ambiguity we assume that the direction the person is facing with respect to the camera is known.
この曖昧さを取り除くために、カメラに対して人が直面している方向が分かっていると仮定する。
0.73
Our results can be seen in Table 2.
結果は第2表で確認できます。
0.66
7
7
0.42
英語(論文から抽出)
日本語訳
スコア
Comparing TIPSy against [48] we can see that although the PCK3D at a threshold of 150mm is similar, TIPSy has achieved an 11% improvement in AUC (threshold 0mm-150mm).
Highlighting that feature sharing between localised groups during training may dampen the generability of a model and that we may achieve improved results by treating them as independent.
Similarly TIPSy achieves a higher performance than other unsupervised approaches and supervised approaches even when trained on the MPI-INF-3DHP dataset.
Legend: (3DHP) denotes the model being trained on the MPI-INF-3DHP dataset.
伝説: (3DHP)は、MPI-INF-3DHPデータセットでトレーニングされているモデルを表す。
0.60
(H36M) denotes the model being trained on the Human3.6M dataset.
(H36M)は、Human3.6Mデータセットでトレーニングされているモデルを指す。
0.57
(+) denotes additional training data.
(+)は追加の訓練データを表す。
0.73
(*) uses transfer learning during from 2Dposenet.
(*)2dposenetからの転送学習を用いる。
0.81
(T) denotes the use of temporal information during training.
(T)は、訓練中の時間的情報の使用を指す。
0.73
All results are taking from their respective papers.
結果はそれぞれの論文から取っています。
0.81
Higher is better, best in bold, second best underlined.
より高い方が良い 大胆で 2番目に良い下線だ
0.75
Approach Method Mehta et al [24] (3DHP + H36M)(*) Supervised Zeng et al [48] (H36M) Supervised Yang et al [46] (H36M)(+) Weakly-Supervised Wandt et al [39] (H36M) Weakly-Supervised Kanazawa et al [15] (3DHP) Weakly-Supervised Chen et al [5] (3DHP)(T) Unsupervised Kundo et al [17] (H36M) Unsupervised Solo Generator (Ours)(H36M) Unsupervised Leg and Torso Generator (Ours)(H36M) Unsupervised TIPSy (Ours)(H36M)
アプローチ方法 Mehta et al [24] (3DHP + H36M)(*) Supervised Zeng et al [48] (H36M) Supervised Yang et al [46] (H36M)(+) Weakly-Supervised Wandt et al [39] (H36M) Weakly-Supervised Kanazawa et al [15] (3DHP)(T) Weakly-Supervised Chen et al [5] (3DHP)(T) Unsupervised Kundo et al [17] (H36M) Unsupervised Solo Generator (Ours)(H36M) Unsupervised Leg and Torso Generator (Ours)(H36M) Unsupervised TIPS (Ours)(H36M)
Therefore, we assume that prior work, including ourselves, trained for a set amount of epochs and picked the weights across these epochs which performed best on an evaluation set.
Firstly, we could monitor the discriminators loss and stop training when it too weak or strong.
まず、差別者の損失を監視し、弱すぎるか強すぎるかのトレーニングを中止できる。
0.69
Though there is intuition for this approach, in practice a strong discriminator can cause a generator to fail due to vanishing gradients [1] and a weak discriminator provides poor feedback to a generator reducing its performance.
Secondly, we could visualise the predictions per epoch and decide by eye which pose is the best.
第二に、時代ごとの予測を可視化し、どのポーズがベストかを目で判断できる。
0.62
Though having potentially hundreds of epochs and thousands of poses, this is not an efficient solution.
数百のエポックと数千のポーズを持つ可能性があるが、これは効率的なソリューションではない。
0.60
Lastly, and more realistically, we could pick the final weight during the training of our model or average the weights between a certain range of epochs to use.
For this scenario we show the stability of our leg and torso generators during adversarial training when compared against a solo generator which can be seen in Figure 3.
このシナリオでは、図3に示すようなソロジェネレータと比較して、逆行訓練中の脚と胴体の安定性を示す。
0.60
As shown, by having a leg and torso generator training together not only is the MPJPE lower, but it is stable over a longer period of time.
Furthermore, these models were trained for 800 epoch’s.
さらに、これらのモデルは800エポックで訓練された。
0.74
By choosing the last epochs weights to evaluate the average error of the leg and torso generators’ would have been 45.2mm and the solo generator’s average error would have been 70.2mm.
From epoch 400 to 800 the average error and standard deviation of our leg and torso generator was 45.4 ± 1.1mm, the average error and standard deviation of a solo generator by comparison was 64.3 ± 6.2mm.
5 Conclusions This paper presented TIPSy, an unsupervised training method for 3D human pose estimation which learns improved 3D poses by learning how to lift independent segments of the 2D kinematic skeleton separately.
We proposed using additional constraints to improve the adversarial self-consistency cycle and highlighted that in a truly unsupervised scenario TIPSy would allow for a more optimum
Figure 3: Figure showing the evaluation error (MPJPE) of the leg and torso generators compared against a solo generator on the Human3.6M dataset for each training epoch.
model to be created through increase GAN stability.
GAN安定性を高めて作成するモデル。
0.71
By exploiting the spatial independence of the torso and legs we are able to reduce the evaluation error by 18% and although we achieve the best performance in certain actions, we are aware that currently TIPSy is unable to completely beat supervised and weakly-supervised approaches.
We do believe however that a TIPSy training approach may carry over to other supervised and weakly-supervised approaches which could improve their results.
Additionally, our high AUC performance in the MPI-INF-3DHP dataset demonstrates that TIPSy can generalise well to unseen poses, improving upon prior supervised models that assume a 2D pose should be treated as codependent localised groups.
Towards principled methods for training generative adver-
ジェネレーティブ・アドバーの訓練の原理化に向けて-
0.56
sarial networks.
サリアルネットワーク
0.54
ArXiv, abs/1701.04862, 2017.
arxiv、abs/1701.04862、2017年。
0.56
[2] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann.
[2]ユジュン・カイ、リューハウ・ゲ、ジュン・リュー、ジャンフェイ・カイ、タット・ジェン・チャム、ジュンソン・ユアン、ナディア・マグネナート・タルマン。 訳抜け防止モード: [2 ) 遊順 カイ、リョーゲ、ジュン・リュー Jianfei Cai, Tat - Jen Cham, Junsong Yuan そして、Nadia Magnenat Thalmann。
0.67
Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks.
グラフ畳み込みネットワークによる3次元ポーズ推定のための時空間関係の展開
0.66
In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2272–2281, 2019.
2019年、IEEE/CVF International Conference on Computer Vision (ICCV)、2272-2281頁。
0.82
doi: 10.1109/ICCV.2019.00 236.
doi: 10.1109/iccv.2019.00 236。
0.40
[3] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh.
[3]Z. Cao、G. Hidalgo、T. Simon、S. Wei、Y. Sheikh。
0.42
Openpose: Realtime multi-person 2d pose estimation using part affinity fields.
Openpose: 部分親和性フィールドを用いたリアルタイム多人数2dポーズ推定。
0.62
IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(01):172–186, jan 2021.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(01):172–186, jan 2021。
0.47
ISSN 1939-3539.
ISSN 1939-3539。
0.36
doi: 10.1109/TPAMI.2019.2 929257.
doi: 10.1109/tpami.2019.2 929257。
0.39
[4] Ching-Hang Chen and Deva Ramanan.
[4]Ching-Hang ChenとDeva Ramanan。
0.44
3d human pose estimation = 2d pose estimation + matching.
3d 人のポーズ推定 = 2d ポーズ推定 + マッチング。
0.70
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5759–5767, 2017.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 5759–5767, 2017
0.41
[5] Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, M. V. Rohith, Stefan Stojanov, and James M. Rehg.
5]Cing-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, M. V. Rohith, Stefan Stojanov, James M. Rehg 訳抜け防止モード: [5]清]-張陳,アンブリッシュ・タイギ,アミット・アグラワル, Dylan Drover, M. V. Rohith, Stefan Stojanov, James M. Rehg
0.72
Unsupervised 3d pose estimation with geometric self-supervision.
幾何学的自己スーパービジョンを用いた教師なし3次元ポーズ推定
0.41
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5707–5717, 2019.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5707–5717, 2019
0.46
[6] Dylan Drover, Rohith M. V, Ching-Hang Chen, Amit Agrawal, Ambrish Tyagi, and Cong Phuoc Huynh.
6]Dylan Drover,Rohith M. V,Ching-Hang Chen,Amit Agrawal,Ambrish Tyagi,Cong Phuoc Huynh。
0.36
Can 3d pose be learned from 2d projections alone?
3dイメージは2dプロジェクションだけで学べるのか?
0.68
In Laura Leal-Taixé and Stefan Roth, editors, Computer Vision – ECCV 2018 Workshops, pages 78–94, Cham, 2019.
laura leal-taixé と stefan roth, editors, computer vision – eccv 2018 workshops, pages 78-94, cham, 2019 において。
Learning pose grammar to encode human body configuration for 3d pose estimation.
学習ポーズ 3次元ポーズ推定のための人体構成を符号化する文法
0.68
In AAAI, 2018.
2018年、AAAI。
0.59
9
9
0.43
英語(論文から抽出)
日本語訳
スコア
[8] David A Forsyth, Okan Arikan, Leslie Ikemoto, Deva Ramanan, and James O’Brien.
David A Forsyth, Okan Arikan, Leslie Ikemoto, Deva Ramanan, James O’Brien ]
0.29
Computa- tional studies of human motion: Tracking and motion synthesis.
コンプタ ヒトの運動のオプティカル研究: 追跡と運動合成
0.56
2006. [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
2006. Ian Goodfellow氏、Jean Pouget-Abadie氏、Mehdi Mirza氏、Bing Xu氏、David Warde-Farley氏、Sherjil Ozair氏、Aaron Courville氏、Yoshua Bengio氏。
0.57
Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27.
敵ネットの生成。 Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, Volume 27。
[10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.
Ishaan Gulrajani氏、Faruk Ahmed氏、Martin Arjovsky氏、Vincent Dumoulin氏、Aaron C Courville氏。
0.35
Improved training of wasserstein gans.
ワッサースタイン・ガンズの訓練改善
0.49
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30.
I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, editors, Advances in Neural Information Processing Systems, Volume 30。 訳抜け防止モード: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach R. Fergus、S. Vishwanathan、R. Garnett、編集者。 ニューラル情報処理システムの進歩 -第30巻-
0.82
Curran Associates, Inc., 2017.
curran associates, inc., 2017年。
0.57
URL https://proceedings. neurips.
URL https://proceedings. neurips
0.38
cc/paper/2017/file/8 92c3b1c6dccd52936e27 cbd0ff683d6-Paper.pd f.
A simple yet effective baseline for 3d human pose estimation.
3次元ポーズ推定のためのシンプルで効果的なベースライン
0.69
pages 2659–2668, 10 2017.
第2659-2668頁、2017年。
0.51
doi: 10.1109/ ICCV.2017.288.
doi: 10.1109/iccv.2017.28 8。
0.46
10
10
0.42
英語(論文から抽出)
日本語訳
スコア
[24] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt.
[24]Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, Christian Theobalt。
0.38
Monocular 3d human pose estimation in the wild using improved cnn supervision.
cnn監視の改善による野生個体の3次元人物ポーズ推定
0.64
In 3D Vision (3DV), 2017 Fifth International Conference on.
2017年、第5回国際会議(3DV)に参加。
0.69
IEEE, 2017.
2017年、IEEE。
0.63
doi: 10.1109/3dv.2017.000 64.
doi: 10.1109/3dv.2017.000 64。
0.41
URL http://gvv.mpi-inf.m pg.de/3dhp_dataset.
URL http://gv.mpi-inf.mp g.de/3dhp_dataset
0.17
[25] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt.
[25]Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, Christian Theobalt。 訳抜け防止モード: [25 ]Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans - Peter Seidel, Weipeng Xu ダン・カサス(Dan Casas)、クリスチャン・テオバルト(Christian Theobalt)。
0.75
Vnect: Real-time 3d human pose estimation with a single rgb camera.
Vnect: 単一のrgbカメラによるリアルタイム3Dポーズ推定。
0.80
volume 36, 2017.
2017年、36巻。
0.62
doi: 10.1145/3072959.3073 596.
doi: 10.1145/3072959.3073 596。
0.52
URL http://gvv.mpi-inf.m pg.de/projects/VNect /.
URL http://gv.mpi-inf.mp g.de/projects/VNect/ 。
0.16
[26] Alejandro Newell, Kaiyu Yang, and Jia Deng.
[26]アレハンドロ・ニューウェル、カイユ・ヤン、ジーア・デン
0.46
Stacked hourglass networks for human pose estimation.
人間のポーズ推定のための重畳時間ガラスネットワーク
0.63
In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 483–499, Cham, 2016.
In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31.
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, R. Garnett, editors, Advances in Neural Information Processing Systems, Volume 31。 訳抜け防止モード: s. bengio, h. wallach, h. larochelle, k. grauman, n. cesabianchi, and r. garnett, editors, advances in neural information processing systems (特集 ニューラル・インフォメーション・プロセッシング) 第31巻。
In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2621–2630, Los Alamitos, CA, USA, oct 2017.
2017年、ieee international conference on computer vision (iccv)、ページ2621-2630、ロスアラミトス、ca、usa、2017年10月。 訳抜け防止モード: 2017年、IEEE International Conference on Computer Vision (ICCV) に参加。 公式サイト 2621–2630, Los Alamitos, CA, USA, oct 2017
[36] Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua.
[36]Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, Pascal Fua。
0.37
Structured prediction of 3d human pose with deep neural networks.
ディープニューラルネットワークを用いた3次元ポーズの構造化予測
0.77
In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 130.1–130.11.
Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), page 130.1–130.11 訳抜け防止モード: エドウィン・R・ハンコック・リチャード・C・ウィルソンとウィリアム・A・P・スミス。 編集者、英国機械ビジョン会議(BMVC)の主催者。 130.1-130.11頁。
0.55
BMVA Press, September 2016.
bmva、2016年9月。
0.62
ISBN 1-901725-59-6.
ISBN 1-901725-59-6。
0.23
doi: 10.5244/C.30.130.
doi: 10.5244/c.30.130。
0.43
URL https://dx.doi.org/1 0.5244/C.30.130.
URL https://dx.doi.org/1 0.5244/C.30.130
0.17
[37] D Tome, Christopher Russell, and L Agapito.
D Tome氏、Christopher Russell氏、L Agapito氏。
0.53
Lifting from the deep: Convolutional 3d pose
深部からのリフティング:畳み込み型3dポーズ
0.80
estimation from a single image, 2017.
一つの画像からの推定は2017年です
0.68
11
11
0.43
英語(論文から抽出)
日本語訳
スコア
[38] S. Tripathi, S. Ranade, A. Tyagi, and A. Agrawal.
[38]S. Tripathi、S. Ranade、A. Tyagi、A. Agrawal。
0.41
Posenet3d: Learning temporally consistent 3d human pose via knowledge distillation.
Posenet3d: 知識蒸留による時間的に一貫した3次元ポーズの学習。
0.58
In 2020 International Conference on 3D Vision (3DV), pages 311–321, Los Alamitos, CA, USA, nov 2020.
2020年、International Conference on 3D Vision (3DV), page 311–321, Los Alamitos, CA, USA, nov 2020。
Towards alleviating the modeling ambiguity of unsupervised monocular 3d human pose estimation.
教師なし単眼3次元ポーズ推定の曖昧さのモデル化緩和に向けて
0.62
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8651–8660, October 2021.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), page 8651–8660, October 2021。 訳抜け防止モード: IEEE / CVF International Conference on Computer Vision (ICCV) に参加して 8651-8660、2021年10月。
0.83
[48] Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang Xu, and Stephen Lin.
Note that the arms are performing a specific movement where as the legs in both images are taking a neutral stance for there particular action.
両画像の両脚が特定のアクションに対して中立的な姿勢を取るように、腕は特定の動きをしている。
0.81
A.3 Improving the consistency cycle
A.3 一貫性サイクルを改善する
0.67
As self-consistency is a constrained optimisation problem, we are able to achieve a better quantitative error by minimising this even if this leads to pose that is easy to discriminate against.
This can be seen clearly in Table 3 where we see a noticeable decrease in MPJPE between the results in [6] and our recreation with the additional consistency constraints mentioned within our work.
Model Drover et al [6] Ours (Improved Consistency)
Model Drover et al [6] Ours (Improved Consistency)
0.42
Direct. Discuss 33.5 28.3
ダイレクト。 ディスク33.5 28.3
0.53
39.3 30.6 Eat Greet 32.9 37.0 35.7 37.1
39.3 30.6 挨拶32.9 37.0 35.7 37.1
0.24
Phone 35.8 41.7
電話 35.8 41.7
0.56
Photo 42.7 33.0
写真42.7 33.0
0.56
Pose 39.0 38.1
ポーズ39.0 38.1
0.50
Purchase 38.2 30.5
購入 38.2 30.5
0.31
Model Drover et al [6] Ours (Improved Consistency)
Model Drover et al [6] Ours (Improved Consistency)
0.42
Sit 42.1 31.1
背番号42.131.1
0.32
SitDown 52.3 30.6
座る 52.3 30.6
0.36
Smoke Wait Walk WalkD.
スモークウェイトウォークが散歩します。
0.44
WalkT. Avg.
ウォーク。 avgだ
0.53
36.9 38.2 33.5 34.9
36.9 38.2 33.5 34.9
0.23
39.4 46.2 36.8 40.2
39.4 46.2 36.8 40.2
0.25
33.2 32.7 34.9 33.9
33.2 32.7 34.9 33.9
0.25
Table 3: Table showing the results of [6] and our recreation with additional consistency constraints.
表3: [6]の結果を示す表と、さらなる一貫性の制約を伴うレクリエーション。
0.79
Because of this, we sought to replace the random rotation self-consistency cycle with something more efficient.
このため, ランダム回転自己整合サイクルをより効率的なものに置き換えようとした。
0.73
This was due to a random rotation lending itself to long training times, where the longer a model is trained the more random rotations it will see and therefore the more consistent it will become.
By contrast our 90◦ consistency constraints allows for 3 specified angles of consistency to be learned per training iteration, while also being more computational efficient then randomly rotation a 3D object and re-projecting it.
These by themselves however aren’t sufficient to learn self-consistency as the model only learns 3 specific angles during training and in the wild many more viewpoints exist.
perform Taylor series expansion while ignoring terms of power 2 and above for the small angle θ:
小さい角度 θ に対して、パワー 2 以上の項を無視しながらテイラー級数展開を行う。
0.73
cancel zi with G(xi, yiw) and remove θ:
g(xi, yiw) で zi をキャンセルし、θ: を削除する。
0.74
xiθ + ˆzi = G(xi, yiw) − ˆziθ
xiθ + シュジ = G(xi, yiw) − シュジθ
0.86
∂ ∂xi
∂xi (複数形 ∂xis)
0.27
G(xi, yi, w)
G(xi, yi, w)
0.42
this leaves us with our final consistency constraint which must be true for all angles:
これは全ての角度で真でなければならない 最終的な一貫性の制約を残します
0.66
xi = − ˆzi
xi = − シュジ
0.79
∂ ∂xi
∂xi (複数形 ∂xis)
0.27
G(xi, yi, w)
G(xi, yi, w)
0.42
xi + ˆzi ∂ ∂xi
xi + イジ ∂xi (複数形 ∂xis)
0.44
G(xi, yi, w) = 0
G(xi, yi, w) = 0
0.43
(14) (15) (16)
(14) (15) (16)
0.43
In practice however implementing the above is difficult.
しかし、実際には実現は困難である。
0.62
This is due to two factors; firstly zi multiplied by the derivative component provides a Jacobian matrix, which to calculate numerically within current deep learning languages is computationally inefficient, requiring over 100 minutes to train one epoch
Secondly as we are finding the derivative with respect to the inputs, to maintain gradient independence all batch-norm layers have to be removed from our model as these normalises across the batch dimension.