We propose a novel optimization-based paradigm for 3D human model fitting on
images and scans. In contrast to existing approaches that directly regress the
parameters of a low-dimensional statistical body model (e.g. SMPL) from input
images, we train an ensemble of per-vertex neural fields network. The network
predicts, in a distributed manner, the vertex descent direction towards the
ground truth, based on neural features extracted at the current vertex
projection. At inference, we employ this network, dubbed LVD, within a
gradient-descent optimization pipeline until its convergence, which typically
occurs in a fraction of a second even when initializing all vertices into a
single point. An exhaustive evaluation demonstrates that our approach is able
to capture the underlying body of clothed people with very different body
shapes, achieving a significant improvement compared to state-of-the-art. LVD
is also applicable to 3D model fitting of humans and hands, for which we show a
significant improvement to the SOTA with a much simpler and faster method.
Enric Corona1, Gerard Pons-Moll2,3, Guillem Aleny`a1, and
Enric Corona1, Gerard Pons-Moll2,3, Guillem Aleny`a1, and
0.38
Francesc Moreno-Noguer1
フランチェスク・モレノ・ノゲール1号
0.40
1Institut de Rob`otica i Inform`atica Industrial, CSIC-UPC, Barcelona, Spain
スペイン・バルセロナ, CSIC-UPC産業研究所
0.41
2University of T¨ubingen, Germany, 3Max Planck Institute for Informatics, Germany
ドイツ・チュービンゲン大学, 3max planck institute for informatics, ドイツ
0.73
Abstract. We propose a novel optimization-based paradigm for 3D human model fitting on images and scans.
抽象。 画像やスキャンに適合する3次元モデルのための新しい最適化手法を提案する。
0.57
In contrast to existing approaches that directly regress the parameters of a low-dimensional statistical body model (e g SMPL) from input images, we train an ensemble of per vertex neural fields network.
The network predicts, in a distributed manner, the vertex descent direction towards the ground truth, based on neural features extracted at the current vertex projection.
At inference, we employ this network, dubbed LVD, within a gradient-descent optimization pipeline until its convergence, which typically occurs in a fraction of a second even when initializing all vertices into a single point.
An exhaustive evaluation demonstrates that our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art.
LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.
Shape recovery then entails at estimating these parameters from data.
次にシェープリカバリは、データからこれらのパラメータを推定する。
0.68
There exist two main paradigms for doing so.
そのためのパラダイムは2つあります。
0.47
On the one side, optimization-based methods iteratively search for the model parameters that best match available image cues, like 2D keypoints [56,11,6,37], silhouettes [41,75] or dense correspondences [28].
On the other side, data-driven regression methods for mesh recovery leverage deep neural networks to directly predict the model parameters from the input [34,28,57,18,25,4,52].
Regardless of the inference method, optimization or regression, and input modality, 2D evidence based on the entire image, keypoints, silhouettes, pointclouds, all these previous methods aim at estimating the parameters of a lowdimensional model (typically based on SMPL [44]).
Learned Vertex Descent (LVD) is a novel optimization strategy in which a network leverages local image or volumetric features to iteratively predict per-vertex directions towards an optimal body/hand surface.
The proposed approach is directly applicable to different tasks with minimal changes on the network, and we show it can fit a much larger variability of body shapes than previous state-of-the-art.
The figure depicts results on the three tasks where we have evaluated LVD: body shape reconstruction from a single image, and 3D fitting of body and hand scans.
the experimental section, these models struggle in capturing detailed body shape, specially for morphotypes departing from the mean (overweight or skinny people) or when the person is wearing loose clothing.
Global shape regression methods lack the error-feedback loop of optimization methods (comparing the current estimate against image / scan input), and hence exhibit an even more pronounced bias towards mean shapes.
To recover more detail, recent works regress or optimize a set of displacements on top of SMPL global shape [3,1,4,10,55], local surface elements [45] or points [47].
Like us, [38] by-pass the regression of global shape parameters and regress model vertices direclty.
私たちと同様、[38]大域的形状パラメータの回帰と回帰モデル頂点が不完全である。
0.71
However, similar to displacement-based methods [1,4], the proposed regression scheme [38] predicts the position of all points in one single pass and lacks an error-feedback loop.
Being more local, they require less training data.
ローカルであることから、トレーニングデータが少なくなる。
0.71
However, these methods do not produce surfaces with a coherent parameterization (e g SMPL vertices), and hence control is only possible with subsequent model fitting, which is hard if correspondences are not known [8,9,32,30].
In this paper, we propose a significantly different approach to all prior model fitting methods.
本稿では,すべての事前モデルフィッティング手法に対して,かなり異なるアプローチを提案する。
0.81
Inspired by classical model-based fitting, where image gradients drive the direction of vertices and in turn global shape parameters, we propose to iteratively learn where 3D vertices should move based on neural features.
For that purpose, we devise a novel data-driven optimization in which an ensemble of per-vertex neural fields is trained to predict the optimal 3D vertex displacement
Bodyshapeestimation3 D body registration3D hand registrationInputInp utInputInputPred.Pre d.Pred.Pred.
Bodyshapeestation3D body registration3D hand registrationInputInp utPred.Pred.Pred.Pre d
0.22
英語(論文から抽出)
日本語訳
スコア
Learned Vertex Descent: A New Direction for 3D Human Model Fitting
Learned Vertex Descent: 3Dモデルフィッティングの新しい方向
0.69
3 towards the ground-truth, based on local neural features extracted at the current vertex location.
3 現在の頂点位置で抽出された局所的なニューラル特徴に基づいて、地道に向けて。
0.51
We dub this network LVD, from ‘Learned Vertex Descent’.
私たちはこのネットワークのLVDを‘Learned Vertex Descent’から掘り下げた。
0.71
At inference, given an input image or scan, we initialize all mesh vertices into a single point and iteratively query LVD to estimate the vertex displacement in a gradient descent manner.
We conduct a thorough evaluation of the proposed learning-based optimization approach.
提案する学習に基づく最適化手法を徹底的に評価する。
0.80
Our experiments reveal that LVD combines the advantages of classical optimization and learning-based methods.
実験の結果,lvdは古典的最適化と学習に基づく手法の利点を併せ持つことが明らかとなった。
0.52
LVD captures off-mean shapes significantly more accurately than all prior work, unlike optimization approaches it does not suffer from local minima, and converges in just 6 iterations.
最適化アプローチとは異なり、LVDは局所最小化に悩まされず、わずか6回で収束する。
0.46
We attribute the better performance to the distributed per-vertex predictions and to the error feedback loop – the current vertex estimate is iteratively verified against the image evidence, a feature present in all optimization schemes but missing in learning-based methods for human shape estimation.
This optimization is fast, does not require gradients and hand-crafted objective functions, and is not sensitive to initialization.
この最適化は高速で、勾配や手作りの目的関数を必要とせず、初期化に敏感ではない。
0.70
– We empirically show that our approach achieves state-of-the-art results in
–我々のアプローチが最先端の結果をもたらすことを実証的に示す。
0.62
the task of human shape recovery from a single image.
一つの画像から人間の形状を復元する作業。
0.82
– The LVD formulation can be readily adapted to the problem of 3D scan fitting.
-LVDの定式化は3Dスキャンフィッティングの問題に容易に適応できる。
0.71
We also demonstrate state-of-the-art results on fitting 3D scans of full bodies and hands.
また,全身と手の3Dスキャンについて,最先端の結果も示す。
0.58
– By analysing the variance of the learned vertex gradient in local neighborhoods we can extract uncertainty information about the reconstructed shape.
This might be useful for subsequent downstream applications that require confidence measures on the estimated body shape.
これは、推定された体形に関する信頼度測定を必要とする下流アプリケーションに有用かもしれない。
0.62
2 Related work 2.1 Parametric models for 3D body reconstruction
2関連作品 2.1 3次元身体再構成のためのパラメトリックモデル
0.65
The de-facto approach for reconstructing human shape and pose is by estimating the parameters of a low-rank generative model [44,56,81,67], being SMPL [44] or SMPL-X [56] the most well known.
We next describe the approaches to perform model fitting from images.
次に、画像からモデルフィッティングを行うアプローチについて述べる。
0.71
Optimization. Early approaches on human pose and shape estimation from images used optimization-based approaches to estimate the model parameters from 2D image evidence.
This is typically accompanied by additional pose priors to ensure anthropomorphism of the retrieved pose [11,56].
これは通常、回収されたポーズ[11,56]の擬人化を保証するために追加のポーズ前を伴っている。
0.50
Subsequent works have devised approaches to obtain better initialization from image cues [37], more efficient optimization pipelines [77], focused on multiple people [23] or extended the approach to multi-view scenarios [42,23].
While optimization-based approaches do not require images with 3D annotation for training and achieve relatively good registration of details to 2D observations, they tend to suffer from the non-convexity of the problem, being slow and falling into local minima unless provided with a good initialization and accurate 2D observations.
And from the other side, the learned vertex displacements help the optimizer to converge to good solutions (empirically observed) in just a few iterations.
On the downside, our approach requires 3D training data, but as we will show in the experimental section, by using synthetic data we manage to generalize well to real images.
Regression. Most current approaches on human body shape recovery consider the direct regression of the shape and pose parameters of the SMPL model [28,6,36,38,57,25,39, 68,73,74].
As in optimization-based methods, different sorts of 2D image evidence have been used, e g keypoints [41], keypoints plus silhouette [58] or part segmentation maps [52].
More recently, SMPL parameters have been regressed directly from entire images encoded by pre-trained deep networks (typically ResNet-like) [34,28,57,18,25].
Instead, we propose a novel optimization framework, that leverages on a pre-learned prior that maps image evidence to vertex displacements towards the body shape.
We will show that despite its simplicity, this approach surpasses by considerable margins all prior work, and provides smooth while accurate meshes without any post-processing.
Integrating additional knowledge such as pre-computed 3D joints, facial key points [2] and body part segmentation [8] significantly improves the registration quality but these pre-processing steps are prone to error and often require human supervision.
However, despite providing the level of detail that parametric models do not have, they are computationally expensive and difficult to integrate within pose-driven applications given the lack of correspondences.
Recent works have already explored possible integrations between implicit and parametric representations for the tasks of 3D reconstruction [32], clothed human modeling [71,45,46], or human rendering [59].
3 Method We next present our new paradigm for fitting 3D human models.
3方法 次に,人間の3次元モデルに適合する新しいパラダイムを提案する。
0.74
For clarity, we will describe our approach in the problem of 3D human shape reconstruction from a single image.
より明確にするために,1枚の画像から3次元の人体形状を復元する手法について述べる。
0.69
Yet, the formulation we present here is generalizable to the problem of fitting 3D scans, as we shall demonstrate in the experimental section.
しかし,本論文の定式化は,実験セクションで示すように,3次元スキャンの適合問題に一般化可能である。
0.70
3.1 Problem formulation Given a single-view image I ∈ RH×W of a person our goal is to reconstruct his/her full body.
3.1 問題定式化 個人の単一視点像 I ∈ RH×W が与えられると、その目的は全身を再構築することである。
0.69
We represent the body using a 3D mesh V ∈ RN×3 with N vertices.
N頂点を持つ3次元メッシュ V ∈ RN×3 を用いて体を表現する。
0.77
For convenience (and compatibility with SMPL-based downstream algorithms) the mesh topology will correspond to that of the SMPL model, with N = 6.890 vertices and triangular connectivity (13.776 faces).
In particular, we do not use the low dimensional pose and shape parameterizations of such models.
特に、そのようなモデルの低次元のポーズや形状のパラメータ化は使用しない。
0.76
英語(論文から抽出)
日本語訳
スコア
6 Corona et al Fig. 2.
6 コロナなど 図2。
0.45
LVD is a novel framework for estimation of 3D human body where local features drive the direction of vertices iteratively by predicting a per-vertex neural field.
At each step t, g takes an input vertex vt i with its corresponding local features, to predict the direction towards its groundtruth position.
各ステップ t において、g は入力頂点 vt i を対応する局所特徴として取り、その基底位置への方向を予測する。
0.72
The surface initialization here follows a T-Posed body, but the proposed approach is very robust to initialization.
ここでの表面初期化はT-Posed体に従うが、提案手法は初期化に対して非常に堅牢である。
0.60
3.2 LVD: Learning Vertex Descent
3.2 LVD: Vertex Descentの学習
0.80
We solve the model fitting problem via an iterative optimization approach with learned vertex descent.
学習頂点降下を用いた反復最適化手法により,モデル適合性問題を解く。
0.80
Concretely, let vt i be the i-th vertex of the estimated mesh V at iteration t.
具体的には、vt i を反復 t における推定メッシュ V の i 番目の頂点とする。
0.67
Let us also denote by F ∈ RH(cid:48)×W (cid:48)×F the pixel aligned image features, and by fi the F -dimensional vector of the specific features extracted at the projection of vt We learn a function g(·) that given the current 3D vertex position, and the image features at its 2D projection, predicts the magnitude and direction of steepest descent towards the ground truth location of the i-th vertex, which we shall denote as ˆvi.
(2) The reconstruction problem then entails iterating over Eq 2 until the convergence of ∆vi.
(2) 再構成問題は、シュヴィの収束まで Eq 2 上で反復する。
0.53
Fig 2 depicts an overview of the approach.
図2はアプローチの概要を描いている。
0.73
Note that in essence we are replacing the standard gradient descent rule with a learned update that is locally computed at every vertex.
本質的には、標準勾配降下規則を、各頂点で局所的に計算される学習された更新に置き換えている。
0.73
As we will empirically demonstrate in the results section, despite its simplicity, the proposed approach allows for fast and remarkable convergence rates, typically requiring only 4 to 6 iterations no matter how the mesh vertices are initialized.
Learned Vertex Descent: A New Direction for 3D Human Model Fitting
Learned Vertex Descent: 3Dモデルフィッティングの新しい方向
0.69
7 Uncertainty estimation. An interesting outcome of our approach is that it allows estimating the uncertainty of the estimated 3D shape, which could be useful in downstream applications that require a confidence measure.
After this process, we obtain the displacements ∆xi j between perturbed points xj and the mesh vertex vi predicted initially.
この過程の後、最初に予測された摂動点xjとメッシュ頂点viとの間での変位 sxi j を得る。
0.73
We then define the uncertainty of vi as:
次にviの不確かさを次のように定義する。
0.47
U (vi) = std({xj + ∆xi
u (vi) = std({xj + ]xi である。
0.73
j}M j=1) .
j>mj=1)であった。
0.52
(3) In Figs.
(3) フィギュア。
0.34
1 and 4 we represent the uncertainty of the meshes in dark blue.
1および4は、暗青色のメッシュの不確実性を表す。
0.72
Note that the most uncertain regions are typically localized on the feet and hands.
最も不確定な地域は通常、足と手で局所化されている。
0.68
3.3 Network architecture
3.3ネットワークアーキテクチャ
0.71
The LVD architecture has two main modules, one that is responsible of extracting local image features and the other of learning the optimal vertices’ displacement.
Given a vertex vt i ) and the input image I, these features are estimated as:
頂点 vt i ) と入力画像 i が与えられると、これらの特徴は次のようになる。
0.68
i = (xt i, yt
i = (xt) I, yt
0.43
i , zt f : (I, π(vt
私はZT f : (i, π(vt))
0.47
i), zt
zt (複数形 zts)
0.46
i ) (cid:55)→ fi ,
i ) (cid:55)→ fi ,
0.48
(4) where π(v) is a weak perspective projection of v onto the image plane.
(4) ここで π(v) は像平面上の v の弱い視点射影である。
0.60
We condition f (·) with the depth zt i of the vertex to generate depth-aware local features.
頂点の深さ zt i で f (·) を条件に深度認識局所特徴を生成する。 訳抜け防止モード: We condition f ( · ) with the depth zt i of the vertex 深度を意識したローカル機能を生成する。
0.86
A key component of LVD is Predicting vertex displacements based on local features, which have been shown to produce better geometric detail, even from small training sets [69,70,16].
These methods learn a mapping from a full image to global shape parameters (two disjoint spaces), which is hard to learn, and therefore they are unable to capture the local details.
This results in poor image overlap between the recovered shape and the image as can be seen in Fig 1.
これにより、復元された形状と図1に示すように画像の重なりが悪くなる。
0.67
Network field.
ネットワークフィールド。
0.75
In order to implement the function g(·) in Eq 1 we follow recent neural field approaches [48,54] and use a simple 3-layer MLP that takes as input the current estimate of each vertex vt i plus its local F -dimensional local feature fi and predicts the displacement ∆vi.
eq 1 で関数 g(·) を実装するために、最近のニューラルネットワークアプローチ [48,54] に従い、各頂点 vt i とその局所 f-次元局所特徴 fi の現在の推定値を入力とし、変位 svi を予測する単純な3層 mlp を用いる。
0.79
3.4 Training LVD
3.4 トレーニングlvd
0.62
Training the proposed model entails learning the parameters of the functions f (·) and g(·) described above.
提案されたモデルのトレーニングには、上述の関数 f (·) と g(·) のパラメータの学習が伴う。
0.83
For this purpose, we will leverage a synthetic dataset of images of people under different clothing and body poses paired with the corresponding SMPL 3D body registrations.
We will describe this dataset in the experimental section.
このデータセットを実験のセクションで説明します。
0.78
英語(論文から抽出)
日本語訳
スコア
8 Corona et al In order to train the network, we proceed as follows: Let us assume we are given a ground truth body mesh ˆV = [ˆv1, . . . , ˆvN ] and its corresponding image I.
8 コロナなど ネットワークをトレーニングするために、次のように進める: 基底真理ボディメッシュ (ground truth body mesh) が与えられ、それに対応するイメージ i が与えられることを仮定する。
0.44
We then randomly sample M 3D points X = {x1, . . . , xM}, using a combination of points uniformly sampled in space and points distributed near the surface.
次に、空間内で一様にサンプリングされた点と表面近傍に分布する点の組み合わせを用いて、ランダムに M 3D 点 X = {x1, . . . , xM} をサンプリングする。 訳抜け防止モード: 次にランダムに M 3D 点 X = { x1, ..., xM } 空間で一様にサンプリングされた点と表面近傍に分布した点の組み合わせを用いて。
0.79
Each of these points, jointly with the input image I is fed to the LVD model which predicts its displacement w.r.t. all ground truth SMPL vertices.
We found that this simple loss was sufficient to learn smooth but accurate body prediction.
この単純な損失は、滑らかだが正確な身体予測を学ぶのに十分であることがわかった。
0.57
Remarkably, no additional regularization losses enforcing geometry consistency or anthropomorphism were required.
注目すべきは、幾何整合性や擬人化を強制する追加の正規化損失は不要である。
0.44
The reader is referred to the Supplemental Material for additional implemen-
読者は補足資料として追加的に紹介される-
0.76
tation and training details. 3.5 Application to 3D scan registration
テートと訓練の詳細 3.5 3dスキャン登録への適用
0.66
The pipeline we have just described can be readily applied to the problem of fitting the SMPL mesh to 3D scans of clothed people or fitting the MANO model [67] to 3D scans of hands.
The only difference will be in the feature extractor f (·) of Eq 4, which will have to account for volumetric features.
唯一の違いは、Eq 4 の特徴抽出器 f (·) にある。 訳抜け防止モード: 唯一の違いは eq 4 の特徴抽出器 f ( · ) にある。 ボリュームの特徴を考慮しなくてはなりません。
0.75
That is, if X is a 3D voxelized input scan, the feature extractor for a vertex vi will be defined as:
つまり、x が3次元ボクセル化入力スキャンであれば、頂点 vi の特徴抽出器は次のようになる。
0.73
f 3D : (X, vi) (cid:55)→ fi ,
f 3D : (X, vi) (cid:55)→ fi ,
0.49
(6) where again, fi will be an F -dimensional feature vector.
(6) ここでも fi は f-次元特徴ベクトルとなる。
0.58
For the MANO model, the number of vertices of the mesh is N = 778.
MANOモデルでは、メッシュの頂点の数は N = 778 である。
0.74
In the experimental section, we will show the adaptability of LVD to this scan registration problem.
本稿では,このスキャン登録問題に対するLVDの適用性を示す。
0.55
4 Connection to classical model based fitting
4 古典的モデルベースフィッティングへの接続
0.84
Beyond its good performance, we find the connection of LVD to classical optimization based methods interesting, and understanding its relationship can be important for future improvements and extensions of LVD.
Optimization methods for human shape recovery optimize model parameters to match image features such as correspondences [27,11,56], silhouettes [75,76].
This approach fails to generalize to novel poses and shapes.
このアプローチは、新しいポーズや形への一般化に失敗する。
0.55
We also compare LVD to Sengupta et al [74], which perform well on real images, even though the predicted shapes do not fit perfectly the silhouettes of the people.
lvd と sengupta et al [74] を比較した。これは実際の画像でうまく機能するが、予測された形状は人々のシルエットに完全に合致しない。
0.66
See also quantitative results in Table 1.
表1の定量的な結果も参照。
0.76
the d dimensional residuals for the N vertices of a mesh, which typically correspond to measuring how well the projected i − th vertex in the mesh fits the image evidence (e.g, matching color of the rendered mesh vs image color).
To minimize e one can use gradient descent, Gauss-Newton or Levenberg-Marquadt (LM) optimizer to find a descent direction for human parameters p, but ultimately the direction is obtained from local image gradients as we will show.
e を最小化するために、Gauss-Newton または Levenberg-Marquadt (LM) オプティマイザを用いて、人間のパラメータ p の降下方向を求めることができるが、最終的に示すように、その方向は局所的な勾配から得られる。 訳抜け防止モード: e を最小にするために、ガウス - ニュートン Levenberg - Marquadt (LM ) Optimizationr to find a descend direction for human parameters p。 しかし、最終的に方向は、私たちが示すように、局所的な画像勾配から得られます。
0.72
Without loss of generality, we can look at the individual residual incurred by one vertex ei ∈ Rd, although bear in mind that an optimization routine considers all residuals simultaneously (the final gradient will be the sum of individual residual gradients or step directions in the case of LM type optimizers).
一般性を欠くことなく、1つの頂点 ei ∈ Rd によって引き起こされる個々の残差を見ることができるが、最適化ルーチンがすべての残差を同時に考えることを念頭に置いている(最後の勾配は、LM型最適化器の場合の個々の残差勾配またはステップ方向の和である)。
0.72
The gradient of a single residual can be computed as
1つの残差の勾配を計算できる
0.63
∇pei = ∂(eT i ei) ∂p
太平= ・(eT i ei) ・p
0.33
= 2 ∂vi ∂p
= 2 ∂vi ∂p
0.39
ei (7) (cid:20) ∂ei
英 (7) (cid:20)∂ei
0.44
∂vi (cid:21)T
∂vi (cid:21)T
0.41
where the matrices that play a critical role in finding a good direction are the error itself ei, and ∂ei which is the Jacobian matrix of the i-th residual with ∂vi respect to the i-th vertex (the Jacobian of the vertex w.r.t. to parameters p is computed from the body model and typically helps to restrict (small) vertex displacements to remain within the space of human shapes).
良い方向を見つける上で重要な役割を果たす行列はエラー自身 ei であり、第 i 番目の頂点に関して ∂vi を持つ i 番目の残差のヤコビ行列 ∂ei である(パラメータ p に対する頂点 w.r.t のヤコビ行列は体モデルから計算され、典型的には(小さな)頂点の変位を人間の形状の空間に残すために制限する)。
0.80
When residuals are based on pixel differences (common for rendering losses and silhouette terms) obtaining ∂ei requires computing image gradients via finite differences.
Such ∂vi classical gradient is only meaningful once we are close to the solution.
そのような ∂vi 古典的勾配は、解に近づくとのみ意味を持つ。
0.60
Learned Vertex Descent.
専門はVertex Descent。
0.66
In stark contrast, our neural fields compute a learned vertex direction, with image features that have a much higher receptive field than a classical gradient.
We first obtain SMPL registrations and manually an-
まずSMPL登録と手動 an- を取得する。
0.63
英語(論文から抽出)
日本語訳
スコア
Learned Vertex Descent: A New Direction for 3D Human Model Fitting
Learned Vertex Descent: 3Dモデルフィッティングの新しい方向
0.69
11 1. Single-view SMPL estimation
11 1.シングルビューsmpl推定
0.64
baseTable lines [56,37,68,18,39,74] in the BUFF Dataset [84].
buffデータセット[84]のbasetable行 [56,37,68,18,39,74]。
0.47
The experiments take into account front, side and back views from the original scans and show that LVD outperforms all baselines in all scenarios and metrics except for back views.
*We also report the results of PIFu, although note that this is a model-free approach in contrast to ours and the rest of the baselines, which recover the SMPL model.
Then, we perform an aggressive data augmentation by synthetically changing body pose, shape and rendering several images per mesh from different views and illuminations.
By doing, this we collect a synthetic dataset of ∼ 600k images which we use for training and validation.
これによって、トレーニングと検証に使用する、600k以上のイメージの合成データセットを収集します。
0.68
Test will be performed on real datasets.
テストは実際のデータセットで実行される。
0.71
Please see Suppl.
Supplを見てください。
0.68
Mat. for more details about the construction of this dataset.
マット。 このデータセットの構築の詳細については
0.72
5.1 3D Body shape estimation from a single image
5.1 単体画像からの3次元物体形状推定
0.79
We evaluate LVD in the task of body shape estimation and compare it against Sengupta et al [74], which uses 2D edges and joints to extract features that are used to predict SMPL parameters.
体型推定タスクにおけるlvdを評価し,2次元エッジと関節を用いてsmplパラメータの予測に用いられる特徴を抽出するsengupta et al [74]と比較した。
0.74
We also compare it against a model that estimates SMPL pose and shape parameters given an input image.
また,入力画像からSMPLのポーズと形状パラメータを推定するモデルと比較した。
0.78
We use a pre-trained ResNet-18 [29] that is trained on the exact same data as LVD.
トレーニング済みのResNet-18[29]をLVDとまったく同じデータでトレーニングしています。
0.68
This approach fails to capture the variability of body shapes and does not generalize well to new poses.
このアプローチは体形の変化を捉えることができず、新しいポーズにうまく一般化しない。
0.61
We attribute this to the limited amount of data (only a few hundred 3D scans), with every image being a training data point, while in LVD every sampled 3D point counts as one training example.
Figure 3 shows qualitative results on in-the-wild images.
図3は、ワイルド画像の質的な結果を示している。
0.49
The predictions of LVD also capture the body shape better than those of Sengupta et al [74] and project better to the silhouette of the input person.
また, LVDの予測は, Sengupta et al [74] よりも体形が良く, 入力者のシルエットに映し出される。
0.61
Even though our primary goal is not pose estimation, we also compare LVD against several recent state-of-the-art model-based methods [56,37,68,18,39] on the BUFF dataset, which has 9612 textured scans of clothed people.
The table also reports the results of PIFu [69], although we should take this merely as a reference, as this is a model-free approach, while the rest of the methods in the Table are model-based.
Figure 4 shows qualitative results on in-the-wild images.
図4は、ワイルド画像の質的な結果を示しています。
0.47
With this experiment, we want to show that pre-
この実験で、私たちはその前例を示したいと思います。
0.58
英語(論文から抽出)
日本語訳
スコア
12 Corona et al 70
12 コロナなど 70
0.40
60 50 40 r o r r e
60 50 40 r o r e である。
0.51
n o i t c u r t s n o c e R S 2 V
n o i t c u r t s n o c e R S 2 V
0.42
SMPLify ProHMR
SMPLify プロムル
0.45
SPIN Sengupta FrankMocap
SPIN 扇合板 フランクモキャップ
0.52
ExPose 0.2 0.4
エキスポス 0.2 0.4
0.42
0.6 0.8 Shape σ
0.6 0.8 形状 σ
0.52
LVD 1 1.2 1.4
LVD 1 1.2 1.4
0.36
Fig. 5. Left: Variability of predicted body shape parameters (x-axis) with respect to vertex error (y-axis, lower is better) for works that fit SMPL to images.
Previous approaches have mostly focused on the task of pose estimation.
従来のアプローチは主にポーズ推定のタスクに重点を置いてきた。
0.56
LVD, instead, aims to represent a more realistic distribution of predicted body shapes.
LVDは、予測された体形をよりリアルに分布させることを目的としている。
0.56
Right: Convergence analysis of the proposed optimization, showing the distance from each SMPL vertex to the groundtruth scan during optimization, averaged for 200 examples of the BUFF dataset.
Generalizing LVD to complex poses will most likely require self-supervised frameworks with in the wild 2D images like current SOTA [34,39,18,37] , but this is out of the scope of this paper, and leave it for future work.
Finally, it is worth to point that some of the baselines [56,68,37,18] require 2D keypoint predictions, for which we use the publicly available code of OpenPose [14].
In any event, we noticed that our model is not particularly sensitive to the quality of input masks, and can still generate plausible body shapes with noisy masks (see Supp. Mat.).
The initial SMPL estimation from LVD is already very competitive against baselines [9,8].
LVDからの最初のSMPL推定は、すでにベースライン [9,8] と非常に競合しています。
0.56
By using these predictions as initialization for SMPL/SMPL+D registration, we obtain ∼ 28.4% and ∼ 37.7% relative improvements with respect to the second-best method[8] in joint and SMPL vertex distances respectively.
Other regions that hardly become occluded, such as torso or head have the lowest error.
胴体や頭部などの閉塞がほとんどない他の領域は、誤りが最も少ない。
0.61
The average vertex error is represented with a thicker black line.
平均頂点誤差はより厚い黒線で表される。
0.64
Finally, we measure the sensitiveness of the convergence to different initializations of the body mesh.
最後に,体メッシュの異なる初期化に対する収束の感度を測定した。
0.78
We uniformly sampled 1K different initializations and analized the deviation of the converged reconstructions.
1Kの異なる初期化を均一にサンプリングし, 収束再構成の偏差を解析した。
0.66
We obtain a standard deviation of the SMPL surface vertices of only σ = 1.2mm across all reconstructions.
すべての再構成でσ = 1.2mmのSMPL曲面頂点の標準偏差が得られる。
0.76
We credit this robustness to the dense supervision during training, which takes input points from a volume on the 3D space, as well as around the groundtruth body surface.
LVD is designed to be general and directly applicable for different tasks.
LVDは汎用的で、様々なタスクに直接適用できるように設計されている。
0.53
We analyze the performance of LVD on the task of SMPL and SMPL+D registration on 3D point-clouds of humans.
ヒトの3次元点雲におけるSMPLおよびSMPL+D登録タスクにおけるLVDの性能を解析した。
0.71
This task consists in initially estimating the SMPL mesh (which we do iterating our approach) and then running a second minimization of the Chamfer distance to fit SMPL and SMPL+D.
2, where we compare against LoopReg [9], IP-Net [8], and also against the simple baseline of registering SMPL with no correspondences starting from a T-Posed SMPL.
次に、LoopReg [9]、IP-Net [8]、およびT-Posed SMPLから始まる対応のないSMPLを登録する単純なベースラインと比較します。 訳抜け防止モード: LoopReg [9 ], IP - Net [8 ] と比較します。 また、T-Posed SMPLから始まる対応のないSMPLを登録する単純なベースラインにも反する。
0.69
Besides the V2V and V2S metrics (bi-directional), we also report the Joint error (predicted using SMPL’s joint regressor), and the distance between ground truth SMPL vertices and their correspondences in the registered mesh (Vertex distance).
Note that again, LVD consistently outperforms the rest of the baselines.
繰り返しますが、LVDはベースラインの残りの部分よりも一貫して優れています。
0.49
This is also qualitatively shown in Fig 6.
これは図6でも定性的に示されています。
0.59
5.4 3D Hand Registration
5.4 3Dハンド登録
0.69
The proposed approach is directly applicable to any statistical model, thus we also test it in the task of registration of MANO [67] from input point-clouds of hands, some of them incomplete.
Learned Vertex Descent: A New Direction for 3D Human Model Fitting
Learned Vertex Descent: 3Dモデルフィッティングの新しい方向
0.69
15 6 Conclusion We have introduced Learned Vertex Descent, a novel framework for human shape recovery where vertices are iteratively displaced towards the predicted body surface.
The proposed method is lightweight, can work real-time and surpasses previous state-of-the-art in the tasks of body shape estimation from a single view or 3D scan registration, of both the full body and hands.
Future work will focus in self-supervised training formulations of LVD for predicting body shape in difficult poses and scenes, and tackling multi-person scenes efficiently.
Acknowledgements This work is partially funded by the Deutsche Forschungsgemeinscha ft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans), German Federal Ministry of Education and Research (BMBF): T¨ubingen AI Center, FKZ: 01IS18039A.
この研究はドイツ研究財団 (dfg) - 409792180 (emmy noether program, project: real virtual humans)、ドイツ連邦教育研究省 (bmbf)、t subingen ai center、fkz: 01is18039aによって部分的に資金提供されている。
0.75
Gerard Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645.
: Loopreg: Selfsupervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration.
loopreg: 3次元メッシュ登録のための暗黙的表面対応, ポーズ, 形状の自己教師あり学習
0.73
NeurIPS 33 (2020)
NeurIPS 33 (2020)
0.42
10. Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.
10. Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G
0.49
: Multi-garment net:
マルチガーメントネット
0.39
Learning to dress 3d people from images.
画像から3dの服装を学ぶこと。
0.75
In: ICCV (2019)
ICCV(2019年)
0.60
11. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image.
19. Corona, E., Hodan, T., Vo, M., Moreno-Noguer, F., Sweeney, C., Newcombe, R., Ma, L.
19. コロナ, E., Hodan, T., Vo, M., Moreno-Noguer, F., Sweeney, C., Newcombe, R., Ma, L. 訳抜け防止モード: 19. コロナ, E., Hodan, T., Vo M., Moreno - Noguer, F., Sweeney, C. Newcombe , R. , Ma , L。
0.84
: Lisa: Learning implicit shape and appearance of hands.
リサ:暗示的な形と手の外観を学ぶこと。
0.70
arXiv preprint arXiv:2204.01695 (2022)
arXiv preprint arXiv:2204.01695 (2022)
0.37
20. Corona, E., Pumarola, A., Alenya, G., Pons-Moll, G., Moreno-Noguer, F.
20.Corona, E., Pumarola, A., Alenya, G., Pons-Moll, G., Moreno-Noguer, F.
0.48
: Smplicit: Topology-aware generative model for clothed people.
smplicit: 衣服の人々のためのトポロジー・アウェア・ジェネレーティブ・モデル。
0.64
In: CVPR. pp. 11875–11885 (2021)
略称はcvpr。 pp.11875-11885(2021年)
0.55
21. Deng, B., Lewis, J.P., Jeruzalski, T., Pons-Moll, G., Hinton, G., Norouzi, M., Tagliasacchi, A.
: Learning to reconstruct 3d human pose and shape via model-fitting in the loop.
再構築の学習 ループ内のモデルフィッティングによる3d人間のポーズと形状。
0.71
In: ICCV (2019)
ICCV(2019年)
0.60
38. Kolotouros, N., Pavlakos, G., Daniilidis, K.
38. Kolotouros, N., Pavlakos, G., Daniilidis, K。
0.42
: Convolutional mesh regression for
畳み込みメッシュ回帰
0.31
single-image human shape reconstruction.
シングルイメージの人間の形状復元。
0.66
In: CVPR (2019)
CVPR(2019年)
0.59
39. Kolotouros, N., Pavlakos, G., Jayaraman, D., Daniilidis, K.
39. Kolotouros, N., Pavlakos, G., Jayaraman, D., Daniilidis, K。
0.42
: Probabilistic modeling for human mesh recovery.
確率論的モデリング メッシュの回復に役立ちます
0.57
In: ICCV. pp. 11605–11614 (2021)
ICCV所属。 巻11605-11614(2021年)
0.55
40. Lahner, Z., Cremers, D., Tung, T.
40. Lahner, Z., Cremers, D., Tung, T。
0.41
: Deepwrinkles: Accurate and realistic clothing
深いしわ:正確でリアルな衣服
0.70
modeling. In: ECCV (2018)
モデリング。 イン:ECCV(2018)
0.61
41. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: Closing the loop between 3d and 2d human representations.
41. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the People: Closing the loop between 3d and 2d human representations。 訳抜け防止モード: 41. Lassner, C., Romero, J., Kiefel M., Bogo, F., Black, M.J., Gehler P.V. : 人々を統一する : 3dと2dの人間の表現の間のループを閉じる。
0.91
In: CVPR (2017)
CVPR(2017年)
0.47
42. Li, Z., Oskarsson, M., Heyden, A.
42. Li, Z., Oskarsson, M., Heyden, A。
0.83
: 3d human pose and shape estimation through collaborative learning and multi-view model-fitting.
協調学習と多視点モデルフィッティングによる3次元人物ポーズと形状推定
0.73
In: WCACV. pp. 1888–1897 (2021)
略称はWCACV。 pp.1888-1897(2021年)
0.57
43. Lin, K., Wang, L., Liu, Z.
43. lin, k., wang, l., liu, z.
0.35
: End-to-end human pose and mesh reconstruction with
エンドツーエンドの人間のポーズとメッシュ再構築
0.60
transformers. In: CVPR.
変圧器だ 略称はcvpr。
0.47
pp. 1954–1963 (2021)
pp. 1954-1963(2021年)
0.74
44. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned
44.Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned
0.49
multi-person linear model. ToG (2015)
多人数線形モデル。 ToG (2015)
0.62
45. Ma, Q., Saito, S., Yang, J., Tang, S., Black, M.J.: Scale: Modeling clothed humans with a surface codec of articulated local elements.
: Neural body fitting: Unifying deep learning and model based human pose and shape estimation.
ニューラルボディフィッティング:深層学習とモデルに基づく人間のポーズと形状推定を統一する。
0.84
In: 3DV. IEEE (2018)
背番号は3dv。 IEEE (2018)
0.39
53. Pan, J., Han, X., Chen, W., Tang, J., Jia, K.
53. pan, j., han, x., chen, w., tang, j., jia, k.
0.35
: Deep mesh reconstruction from single
シングルメッシュからのディープメッシュ再構築
0.60
rgb images via topology modification networks.
RGB画像 トポロジー修正ネットワークを通して
0.78
In: ICCV. pp. 9964–9973 (2019)
ICCV所属。 9964-9973頁(2019年)。
0.48
54. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.
54. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.
0.46
: Deepsdf: Learning
deepsdf: 学習
0.57
continuous signed distance functions for shape representation.
形状表現のための連続符号距離関数。
0.79
In: CVPR (2019)
CVPR(2019年)
0.59
55. Patel, C., Liao, Z., Pons-Moll, G.
55. Patel, C., Liao, Z., Pons-Moll, G.
0.48
: Tailornet: Predicting clothing in 3d as a function
タイラーネット:3次元の衣服を機能として予測する
0.68
of human pose, shape and garment style.
人間のポーズ、形、服装のスタイルです
0.73
In: CVPR. IEEE (jun 2020)
略称はcvpr。 IEEE(2020年5月)
0.59
英語(論文から抽出)
日本語訳
スコア
18 Corona et al 56. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image.
In our experiments, H = W = 256, f is a stacked hourglass network [50] trained from scratch with 4 stacks and batch normalization replaced with group normalization [80].
我々の実験では、H = W = 256, f は4つのスタックでゼロから訓練し、バッチ正規化をグループ正規化 [80] に置き換えたスタックグラスネットワーク [50] である。
0.71
The feature embeddings have size 128× 128 with 256 channels each.
機能埋め込みのサイズは128×128で、それぞれ256チャンネルがある。
0.71
Therefore, query points have a feature size of F = 256 × 4 = 1024.
したがって、クエリポイントはf = 256 × 4 = 1024という特徴サイズを持つ。
0.81
The MLP f is formed by 3 fully connected layers with Weight Normalization [72], and deeper architectures or positional encoding did not help to improve performance.
The networks are trained end-to-end with batch size 4, learning rate 0.001 during 500 epochs, and then with linear learning rate decay during 500 epochs more.
Implementation-wise f has an output dimension of N = 6890.
実装ワイド f は出力次元が N = 6890 である。
0.82
When estimating an SMPL shape, we input a surface of 6890 × 3 and obtain a prediction tensor with shape 6890× 6890× 3, from which we sample the diagonal to obtain per-vertex displacements (6890 × 3) and move each vertex in the correct direction.
For the task of registration of the MANO model [67], we instead predict 778 vertices.
MANOモデル [67] の登録作業では, 代わりに 778 の頂点を推定する。
0.77
To compare LVD against other baselines, we used their available code.
LVDを他のベースラインと比較するために、利用可能なコードを使用しました。
0.58
For SMPL-X, we fitted the SMPL model for better comparison with ours and previous works, using their most recent code (SMPLify-X) with the variational prior.
Learned Vertex Descent: A New Direction for 3D Human Model Fitting
Learned Vertex Descent: 3Dモデルフィッティングの新しい方向
0.69
21 Fig. 1. Convergence plot of the proposed optimization, for voxel-based experiments in comparison to image-based reconstruction.
21 図1。 voxelに基づく実験のための最適化の収束プロットと画像による再構成との比較
0.48
In comparison with the reported results on image-based reconstruction (which also are shown in the main paper), volumetric reconstruction takes almost a second to converge with our settings.
Experiments were run on a single GeForce GTX 1080 Ti GPU.
実験は1台のGeForce GTX 1080 Ti GPU上で行われた。
0.80
The black line represents the average of all vertex errors while the remaining colors show how the error is distributed among different body parts, e g .
For the task of human reconstruction from images, we then render each augmentation by rotating around the yaw axis to gather views with different illuminations.
Note that the original data consisted only of a few hundred 3D scans, all with very average body shapes.
オリジナルのデータは数百の3dスキャンのみで、すべて非常に平均的な体型だった。
0.69
The augmentation led the model represent more diverse shapes and avoid overfitting, but the proposed Learned Vertex Descent paradigm was necessary for it to represent them well.
We also show more qualitative examples of 3D reconstruction from a single view in-the-wild in Fig 3, and Fig 4 shows comparisons with the rest of the methods that are not shown in the main document.
In particular, we noted several differences between optimization-based and learning-based body pose/shape estimation methods.
特に,最適化に基づく身体ポーズ/形状推定法と学習に基づく身体ポーズ/形状推定法の違いを指摘した。
0.50
On one hand, optimization-based methods [11,56] are often accurate, but have severe failure cases and are slow.
一方、最適化に基づく手法[11,56]は、しばしば正確であるが、深刻な障害があり、遅い。
0.71
On the other hand, learning based methods [37,68,18,39] regress global parameters from the full image.
一方,学習ベース手法[37,68,18,39]は全画像からグローバルパラメータを回帰する。
0.85
Hence, the shape estimates have a strong bias towards the mean.
したがって、形状推定は平均に対して強いバイアスを持つ。
0.75
Moreover, learning-based methods are not able to verify their initial estimates against the image.
さらに、学習に基づく手法では、画像に対する初期推定を検証できない。
0.76
Our goal in this paper is to combine the advantages of both methods.
本稿では,両手法の利点を組み合わせることを目的とする。
0.74
LVD produces varied shape estimates thanks to the learned per vertex descent directions which are conditioned on local image evidence, and can work in real time.
In this direction, Fig. 5 includes more results on the task of 3D registration of 3D scans and Fig 6 shows 3D registration results of MANO of LVD in comparison to those of IP-
Failure cases from LVD in body shape estimation from single view images (first row), 3D registration of humans from point clouds (second row - left) and 3D registration from hands (second row - right).
For the task of body shape estimation from single view (First row), the body shapes we can generate are limited by the SMPL model and the training data, and cannot accurately reproduce body shapes of e g pregnant women (second example).
For instance, examples in Fig 7 top-left and top-right show scenarios that are rare in the train data, and the predicted body does not correctly adjust to the input image.