We introduce 3D Moments, a new computational photography effect. As input we
take a pair of near-duplicate photos, i.e., photos of moving subjects from
similar viewpoints, common in people's photo collections. As output, we produce
a video that smoothly interpolates the scene motion from the first photo to the
second, while also producing camera motion with parallax that gives a
heightened sense of 3D. To achieve this effect, we represent the scene as a
pair of feature-based layered depth images augmented with scene flow. This
representation enables motion interpolation along with independent control of
the camera viewpoint. Our system produces photorealistic space-time videos with
motion parallax and scene dynamics, while plausibly recovering regions occluded
in the original views. We conduct extensive experiments demonstrating superior
performance over baselines on public datasets and in-the-wild photos. Project
page: https://3d-moments.g ithub.io/
Qianqian Wang1,2 Zhengqi Li1 David Salesin1 Noah Snavely1,2 Brian Curless1,3
Qianqian Wang1,2 Zhengqi Li1 David Salesin1 Noah Snavely1,2 Brian Curless1,3
0.29
Janne Kontkanen1
Janne Kontkanen1
0.44
1Google Research 2Cornell Tech, Cornell University 3University of Washington
1Google Research 2Cornell Tech, Cornell University 3University of Washington
0.46
2 2 0 2 y a M 2 1
2 2 0 2 y a m 2 1 である。
0.52
] V C . s c [
] 略称はC。 sc [
0.39
1 v 5 5 2 6 0
1 v 5 5 2 6 0
0.42
. 5 0 2 2 : v i X r a
. 5 0 2 2 : v i X r a
0.42
Figure 1. People often take many near-duplicate photos in an attempt to capture the perfect expression.
図1に示す。 完璧な表現を撮るために、人々はしばしば重複に近い写真を撮ります。
0.66
Given a pair of these photos, taken from nearby viewpoints (left), our proposed approach brings these photos to life as 3D Moments, producing space-time videos with cinematic camera motions and interpolated scene motion (right).
As output, we produce a video that smoothly interpolates the scene motion from the first photo to the second, while also producing camera motion with parallax that gives a heightened sense of 3D.
This representation enables motion interpolation along with independent control of the camera viewpoint.
この表現は、カメラ視点の独立制御とともに、モーション補間を可能にする。
0.62
Our system produces photorealistic space-time videos with motion parallax and scene dynamics, while plausibly recovering regions occluded in the original views.
We call this new effect 3D Moments: given a pair of near-duplicate photos depicting a dynamic scene from nearby (perhaps indistinguishable) viewpoints, such as the images in Fig 1 (left), our goal is to simultaneously enable cinematic camera motion with 3D parallax (including novel, extrapolated viewpoints) while faithfully interpolating scene motion to synthesize short space-time videos like the one shown in Fig 1 (right).
3D Moments combine both camera and scene motion in a compelling way, but involve very challenging vision problems: we must jointly infer 3D geometry, scene dynamics, and content that becomes newly disoccluded during the animation.
Despite great progress towards each of these individual problems, tackling all of them jointly is non-trivial, especially with image pairs with unknown camera poses as input.
To address these challenges, we propose a novel approach for creating 3D Moments by explicitly modeling time-varying geometry and appearance from two uncalibrated, near-duplicate photos.
To render a novel view at a novel time, we lift these feature LDIs into a pair of 3D point clouds, and employ a depth-aware, bidirectional splatting and rendering module that combines the splatted features from both directions.
We extensively test our method on both public multi-view dynamic scene datasets and in-the-wild photos in terms of rendering quality, and demonstrate superior performance compared to state-of-the-art baselines.
In summary, our main contributions include: (1) the new task of creating 3D Moments from near-duplicate photos of dynamic scenes, and (2) a new representation based on feature LDIs augmented with scene flows, and a model that can be trained for creating 3D Moments.
Recent neural rendering methods achieve impressive synthesis results [17, 20, 43, 44, 47, 54], but typically assume many views as input and thus do not suit our task.
We focus here on methods that take just one or two views.
ここでは、1つか2つのビューしか持たないメソッドに注目します。
0.53
Many single-view synthesis methods involve estimating dense monocular depths and filling in occluded regions [7, 11, 14, 25, 34, 38, 48], while others seek to directly regress to a scene representation in a single step [30, 35, 45, 46, 53].
We draw on ideas from several works in this vein: SynSin learns a feature 3D point cloud for each input image and projects it to the target view where the missing regions are inpainted [48].
We build on both methods but extend to the case of dynamic scenes.
両方の手法で構築するが、動的なシーンに拡張する。
0.59
Like our method, some prior view synthesis methods operate on two views.
我々の方法と同様に、いくつかの先行ビュー合成法は2つのビューで動作する。
0.55
For instance, Stereo Magnification [56] and related work [40] take two narrow-baseline stereo images and predict a multi-plane image that enables real-time novel view synthesis.
However, unlike our approach, these methods assume that there is some parallax from camera motion, and again can only model static scenes, not ones where there is scene motion between the two input views.
Frame interpolation methods do not distinguish between camera and scene motion: all object motions are interpolated in 2D image space.
フレーム補間法はカメラとシーンの動きを区別しない:全ての物体の動きは2次元画像空間で補間される。
0.80
Moreover, most frame interpolators assume a linear motion model [2, 6, 8, 12, 21–24, 26, 39] although some recent works consider quadratic motion [18, 50].
Most of the interpolators use image warping with optical flow, although as a notable exception, Niklaus et al [23, 24] synthesize intermediate frames by blending the inputs with kernels predicted by a neural network.
ほとんどの補間器は、光学的流れを伴うイメージワープを用いるが、例外として、Niklaus et al [23, 24]は、入力をニューラルネットワークによって予測されたカーネルとブレンドすることによって中間フレームを合成する。
0.70
However, frame interpolation alone cannot generate 3D Moments, since it does not recover the 3D geometry or allow control over camera motion in 3D.
Space-time view synthesis. A number of methods have sought to synthesize novel views for dynamic scenes in both space and time by modeling time-varying 3D geometry and appearance.
Recently, several neural rendering approaches [15,27–29,49,52] have shown promising results on space-time view synthesis from monocular dynamic videos.
To interpolate both viewpoints and time, recent works either directly interpolate learned latent codes [27, 28], or apply splatting with estimated 3D scene flow fields [15].
Our goal is to create 3D Moments by independently controlling the camera viewpoint while simultaneously interpolating scene motion to render arbitrary nearby novel views at arbitrary intermediate times t ∈ [0, 1].
我々のゴールは、カメラの視点を独立に制御し、同時にシーンの動きを補間し、任意の中間時間 t ∈ [0, 1] で任意の新しいビューを描画することである。
0.65
Our output is a space-time video with cinematic camera motions and interpolated scene motion.
我々の出力は、撮影カメラモーションと補間されたシーンモーションを備えた時空ビデオである。
0.57
To this end, we propose a new framework that enables efficient and photorealistic space-time novel view synthesis
A 2D feature extractor is applied to each color layer of the inpainted LDIs to obtain feature layers, resulting in feature LDIs (F0,F1), where colors in the inpainted LDIs have been replaced with features.
To render a novel view at intermediate time t, we lift the feature LDIs to a pair of 3D point clouds (P0, P1) and bidirectionally move points along their scene flows to time t.
We then project and splat these 3D feature points to form forward and backward 2D feature maps (from P0 and P1, respectively) and their corresponding depth maps.
We linearly blend these maps with weight map Wt derived from spatio-temporal cues, and pass the result to an image synthesis network to produce the final image.
The key to our approach is building feature LDI from each of the inputs, where each pixel in the feature LDI consists of its depth, scene flow and a learnable feature.
Finally, to render a novel view at intermediate time t, we lift the feature LDIs into a pair of point clouds (P0,P1) and propose a scene-flow-based bidirectional splatting and rendering module to combine the features from two directions and synthesize the final image.
Our method first computes the underlying 3D scene geometry.
本手法は,まず基礎となる3次元シーン形状を計算する。
0.58
As near-duplicates typically have scene dynamics and very little camera motion, standard Structure from Motion (SfM) and stereo reconstruction methods fail to produce reliable results.
ほぼ二重化が典型的にはシーンダイナミックスを持ち、カメラモーションがほとんどないため、標準的なStructure from Motion (SfM) とステレオ再構成法は信頼性に欠ける。
0.64
Instead, we found that state-of-the-art
その代わりに私たちは最先端のテクノロジーを発見した
0.31
monocular depth estimator DPT [31] can produce sharp and plausible dense depth maps for images in the wild.
Therefore, we rely on DPT to obtain the geometry for each image.
したがって、各画像の幾何を得るにはDPTに頼っている。
0.76
To account for small camera pose changes between the views, we compute optical flow between the views using RAFT [42], estimate a homography between the images using the flow, and then warp I1 to align with I0.
Because we only want to align the static background of two images, we mask out regions with large optical flow, which often correspond to moving objects, and compute the homography using the remaining mutual correspondences given by the flow.
Once I1 is warped to align with I0, we treat their camera poses as identical.
i1がi0に合致するように反動されると、カメラのポーズは同一視される。
0.62
To simplify notation, we henceforth re-use I0 and I1 to denote the aligned input images.
表記をシンプルにするため,I0とI1を再使用し,一致した入力画像を示す。
0.65
We then apply DPT [32] to predict the depth maps for each image.
次に,各画像の深度マップの予測にDPT[32]を適用した。
0.82
To align the depth range of I1 with I0 we estimate a global scale and shift for I1’s disparities (i.e., 1/depth), using flow correspondences in the static regions.
Next, we convert the aligned photos and their dense depths to an LDI representation [37], in which layers are separated according to depth discontinuities, and apply RGBD inpainting in occluded regions as described below.
Interpolate to time t & splatI0I1Scene flows…Interpolate to time t&splatNovel view at time t 2D feature extractor Image synthesis networkLiftingFeatur e LDIFeature LDIInpainted depth and color layersFeature layersInpainted depth and color layersFeature layers
時間t&splatI1Scene flow...Interpolate to time t&splatNovel view at time t2D feature extractor Image synthesis networkLiftingFeatur e LDIFeature LDIInpainted depth and color layerFeature layerInpainted depth and color layerFeature layer 訳抜け防止モード: 時間 t & splati0i1scene流の補間...時間 t&splatnovelビューへの補間 t 2d feature extractor image synthesis networkliftingfeatur e ldifeature ldiinpainted depth 彩色層 彩色層 彩色層 彩色層
0.81
英語(論文から抽出)
日本語訳
スコア
to be computationally expensive and the output difficult to feed into a training pipeline.
計算コストが高く、訓練パイプラインへの投入が困難となる。
0.62
More recently, Jampani et al.
最近ではJampaniら。
0.46
[7] employ a two-layer approach that would otherwise suit our requirements but is restricted in the number of layers.
7] 要件には適合するが,レイヤ数には制限がある2層アプローチを採用しています。
0.79
We therefore propose a simple, yet effective strategy for creating and inpainting LDIs that flow well into our learningbased pipeline.
We apply the clustering to the disparities of both images to obtain their LDIs, L0 (cid:44) 1}L1 {Cl l=1, where Cl and Dl represent the lth color and depth layer respectively, and L0 and L1 denote the number of layers constructed from I0 and I1, respectively.
Each color layer is an RGBA image, with the alpha channel indicating valid pixels in this layer.
各色層はRGBA画像であり、この層に有効なピクセルを示すαチャネルを持つ。
0.80
l=1 and L1 (cid:44) {Cl 0}L0
l=1 と L1 (cid:44) {Cl 0}L0
0.78
0, Dl 1, Dl
0, Dl。 1,dl
0.62
Next, we apply depth-aware inpainting to each color and depth LDI layer in occluded regions.
次に,隠蔽領域の各色および深度LDI層に深度認識の塗布を適用した。
0.83
To inpaint missing contents in layer l, we treat all the pixels between the lth layer and the farthest layer as the context region (i.e., the region used as reference for inpainting), and exclude all irrelevant foreground pixels in layers nearer than layer l.
We keep only inpainted pixels whose depths are smaller than the maximum depth of layer l so that inpainted regions do not mistakenly occlude layers farther than layer l.
We adopt the pre-trained inpainting network from Shih et al [38] to inpaint color and depth at each layer.
我々は,Sh et al[38]からの事前学習した塗布網を各層に塗布した色と深さに採用した。
0.62
Fig 3 (b) shows an example of LDI layers after inpainting.
図3(b)は、塗装後のldi層の例を示す。
0.80
Note that we choose to inpaint the two LDIs up front rather than performing perframe inpainting for each rendered novel view, as the latter would suffer from multi-view inconsistency due to the lack of a global representation for disoccluded regions.
We now have inpainted color LDIs L0 and L1 for novel view synthesis.
カラー ldis l0 と l1 を新しいビュー合成のために塗り替えた。
0.67
From each individual LDI, we could synthesize new views of the static scene.
個々のldiから、静的シーンの新しいビューを合成することができます。
0.67
However, the LDIs alone do not model the scene motion between the two photos.
しかし、LDIは2枚の写真の間のシーンの動きをモデル化していない。
0.69
To enable motion interpolation, we estimate 3D motion fields between the images.
動き補間を可能にするために,画像間の3次元動き場を推定する。
0.68
To do so, we first compute 2D optical flow between the two aligned images and perform a forward and backward consistency check to identify pixels with mutual correspondences.
Given 2D mutual correspondences, we use their associated depth values to compute their 3D locations and lift the 2D optical flow to 3D scene flow, i.e., 3D translation vectors that displace each 3D point from one time to another.
This process gives the scene flow for mutually visible pixels of the LDIs.
このプロセスは、LDIの相互可視画素に対するシーンフローを与える。
0.77
However, for pixels that do not have mutual correspondences, such as those occluded in the other view or those
しかし、他の視点やそれらに偏っているような相互通信を持たない画素に対しては、
0.72
Figure 3. From an image to an inpainted LDI.
図3。 画像からインペイントされたLDIへ。
0.74
Given an input image and its estimated monocular depth [31], we first apply agglomerative clustering [19] to separate the RGBD image into multiple (in this example 3) RGBDA layers as shown in
In particular, for each pixel in L0 with a corresponding point in L1, we store its associated scene flow at its pixel location, resulting in scene flow layers initially containing only well-defined values for mutually visible pixels.
To inpaint the remaining scene flow, we perform a diffusion operation that iteratively applies a masked blur filter to each scene flow layer until all pixels in L0 have scene flow vectors.
To render an image from a novel camera viewpoint and time with these two scene-flow-augmented LDIs, one simple approach is to directly interpolate the LDI point locations to the target time according to their scene flow and splat RGB values to the target view.
We therefore correct for such errors by training a 2D feature extraction network that takes each inpainted LDI color layer Cl as input and produces a corresponding 2D feature map Fl.
These features encode local appearance of the scene and are trained to mitigate rendering artifacts introduced by inaccurate depth or scene flow and to improve overall rendering quality.
This step converts our inpainted color LDIs to feature LDIs F0 (cid:44) {Fl 0}L0 l=1, 1}L1 F1 (cid:44) {Fl l=1, both of which are augmented with scene flows.
このステップは、塗装された色LDIをLDIF0 (cid:44) {Fl 0}L0 l=1, 1}L1 F1 (cid:44) {Fl l=1に変換する。
0.82
Finally, we lift all valid pixels for these fea-
最後に、これらすべての有効なピクセルを持ち上げる。
0.67
1, Dl 0, Dl
1,dl 0, Dl。
0.62
(a)LDI (b)InpaintedLDI
(a)LDI (b)InpaintedLDI
0.43
英語(論文から抽出)
日本語訳
スコア
ture LDIs into a pair of point clouds P0 (cid:44) {(x0, f0, u0)} and P1 (cid:44) {(x1, f1, u1)}, where each point is defined with 3D location x, appearance feature f, and 3D scene flow u.
Given a pair of 3D feature point clouds P0 and P1, we wish to interpolate and render them to produce the image at a novel view and time t.
一対の3D特徴点雲 P0 と P1 が与えられたら、それらを補間してレンダリングして、新しいビューとタイム t で画像を生成したい。
0.73
Inspired by prior work [2, 21], we propose a depth-aware bidirectional splatting technique.
先行研究 [2, 21] に触発されて, 深度を考慮した双方向スプラッティング手法を提案する。
0.60
In particular, we first obtain the 3D location of every point (in both point clouds) at time t by displacing it according to its associated scene flow scaled by t: x0→t = x0 + tu0, x1→t = x1 + (1 − t)u1.
The displaced points and their associated features from each direction (0 → t or 1 → t) are then separately splatted into the target viewpoint using differentiable point-based rendering [48], which results in a pair of rendered 2D feature maps F0→t, F1→t and depth maps D0→t, D1→t.
(3) Here β ∈ R+ is a learnable parameter that controls contributions based on relative depth.
(3) ここで、β ∈ R+は相対的な深さに基づいて寄与を制御する学習可能なパラメータである。
0.65
Finally, Ft and Dt are fed to a network that synthesizes the final color image.
最後に、FtとDtは最終色画像を合成するネットワークに供給される。
0.82
3.5. Training We train the feature extractor, image synthesis network, and the parameter β on two video datasets to optimize the rendering quality, as described below.
To train our system, we ideally would use image triplets with known camera parameters, where each triplet depicts a dynamic scene from a moving camera, so that we can use two images as input and the third one (at an intermediate time and novel viewpoint) as ground truth.
However, such data is difficult to collect at scale, since it either requires capturing dynamic scenes with synchronized multi-view camera systems, or running SfM on dynamic videos shot from moving cameras.
The former requires a time-consuming setup and is difficult to scale to in-the-wild scenarios, while the latter cannot guarantee the accuracy of estimated camera parameters due to moving objects and
For the first source, we use Vimeo-90K [51], a widely used dataset for learning frame interpolation.
まず、フレーム補間学習に広く使われているデータセットであるVimeo-90K[51]を使用する。
0.77
For the second source, we use the MannequinChallenge dataset [14], which contains over 170K video frames of humans pretending to be statues captured from moving cameras, with corresponding camera poses estimated through SfM [56].
We could conceptually train this whole system, but in practice we train only modules
概念的にはこのシステム全体を訓練できますが、実際にはモジュールのみを訓練します
0.67
(c), (d), and
(c) (d)および
0.48
(e), and use pretrained state-of-the-art models [31,38] for
(e)および事前訓練された最先端モデル[31,38]を使用する
0.68
(a) and (b). This makes training less computationally expensive, and also avoids the need for the largescale direct supervision required for learning high-quality depth estimation and RGBD inpainting networks.
Training losses. We train our system using image reconstruction losses.
訓練損失。 画像再構成損失を用いてシステムを訓練する。
0.52
In particular, we minimize perceptual loss [9,55] and l1 loss between the predicted and ground-truth images to supervise our networks.
特に,予測画像と地上画像の知覚損失 [9,55] とl1損失を最小化し,ネットワークの監視を行う。
0.82
4. Experiments 4.1. Implementation details
4. 実験 4.1 実施内容
0.67
For the feature extractor, we use ResNet34 [5] truncated after layer3 followed by two additional up-sampling layers to extract feature maps for each RGB layer, which we augment with a binary mask to indicate which pixels are covered (observed or inpainted) in that layer.
becoming semi-transparent due to gaps between samples when the camera zooms in.
カメラがズームインすると、サンプル間の隙間によって半透明になる。
0.67
We train our system using Adam [10], with base learning rates set to 10−4 for the feature extractor and image synthesis network, and 10−6 for the optical flow network [42].
We discard sequences with large alignment errors during training.
トレーニング中にアライメントエラーの大きなシーケンスを破棄する。
0.64
Please refer to the supplement for additional details.
詳細はサプリメントを参照してください。
0.46
4.2. Baselines
4.2. ベースライン
0.52
To our knowledge, there is no prior work that serves as a direct baseline for our new task of space-time view synthesis from the near-duplicate photos.
We then use 2D optical flows generated by RAFT [42] to find pixels with mutual correspondences and compute their scene flows in the forward and backward directions.
Specifically, to synthesize an image at the novel time and viewpoint, we first adopt a state-of-the-art frame interpolation method, XVFI [39], to synthesize a frame at the intermediate time.
We then apply 3D photo inpainting [38] to turn the interpolated frame into an inpainted LDI and render it from a desired viewpoint through a constructed mesh.
For a fair comparison, we upgrade the 3D photo method to use the state-of-the-art monocular depth backbone DPT [31], i.e., the same monocular depth predictor we use in our approach.
This baseline reverses the order of operations in the aforementioned method.
このベースラインは、前述のメソッドの操作順序を反転する。
0.78
First, we apply the 3D photo [38] to each of the near-duplicates and render them to the target viewpoint separately.
まず, 3D 写真 [38] を近距離倍率それぞれに適用し, 対象視点に別々にレンダリングする。
0.74
We then apply XVFI [39] to these two rendered images to obtain a final view at intermediate time t.
次に、この2つのレンダリング画像にXVFI[39]を適用し、中間時間tで最終ビューを得る。
0.69
4.3. Comparisons on public benchmarks Evaluation datasets.
4.3. 公開ベンチマーク評価データセットの比較。
0.55
We evaluate our method and baselines on two public multi-view dynamic scene datasets: the NVIDIA Dynamic Scenes Dataset [52] and the UCSD MultiView Video Dataset [16].
The videos are recorded by 10 synchronized action cameras at 120FPS.
ビデオは120FPSで10台の同期アクションカメラで録画される。
0.72
We run COLMAP [36] on each of the multi-view videos (masking out dynamic components using provided motion masks) to obtain camera parameters and sparse point clouds of the static scene contents.
In each triplet, we select the two input views to be at the same camera viewpoint and two frames apart, and the target view to be the middle frame at a nearby camera viewpoint.
To properly render images into the target viewpoint and compare with the ground truth, we need to obtain aligned depth maps that are consistent with the reconstructed scenes.
Note that all the methods have relatively low PSNR/SSIM because these metrics are sensitive to pixel misalignment, and inaccurate geometry from monocular depth networks can cause the rendered images to not fully align with the ground truth.
For “No inpainting”, we train the system without inpainting color and depth in our LDIs and rely on the image synthesis network to fill in disoccluded regions in each rendered view separately (prone to temporal inconsistency).
No inpainting”では、LDIの色と深さを塗ることなくシステムをトレーニングし、画像合成ネットワークを使って各レンダリングビューの非排除領域を別々に埋める(時間的不整合を伴わない)。
0.76
For “No bidirectional warping”, we use only single-directional scene flow from time 0 to time 1.
No bidirectional warping”では、時間0から時間1までの単方向のシーンフローのみを使用します。
0.78
Performance. Our method can be applied to new nearduplicate photo pairs without requiring test-time optimization.
These operations are performed once for each duplicate pair.
これらの操作は、重複するペアごとに1回行われる。
0.59
The projection-and-image -synthesis stage takes 0.71s to render each output frame.
投影・画像合成段階は、各出力フレームの描画に 0.71 秒かかる。
0.55
5. Discussion and Conclusion We presented a new task of creating 3D Moments from near-duplicate photos, allowing simultaneous view extrapolation and motion interpolation for a dynamic scene.
By training on both posed and unposed video datasets, our method is able to produce photorealistic space-time videos from the near-duplicate pairs without substantial visual artifacts or temporal inconsistency.
Applying frame interpolation and then 3D Photos leads to strong flickering artifacts due to inconsistent inpainting in each frame (see supplement video).
We refer readers to the supplementary video for better visual comparisons of these generated 3D
これらの生成した3Dの視覚的比較のための補足映像を参照する。
0.67
(a) (b) (c)
(a) (b) (c)
0.43
(d) (e)
(d) (e)
0.43
英語(論文から抽出)
日本語訳
スコア
Figure 5. Qualitative comparisons on in-the-wild photos.
図5。 in-the-wild写真における質的比較
0.66
Compared with the baselines, our approach produces more realistic views with significantly fewer visual artifacts, especially in moving or disoccluded regions.
Please refer to the supplementary video for failure cases.
障害事例については補足ビデオを参照。
0.63
Fu- ture work includes designing an automatic selection scheme for photo pairs suitable for 3D Moment creation, automatically detecting failures, better modeling of large or non-linear motions, and extending the current method to handle more than two near-duplicate photos.
2 [4] Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec.
Asymmetric bilateral motion estimation for video frame interpolation.
ビデオフレーム補間のための左右非対称運動推定
0.82
In ICCV, 2021.
ICCV、2021年。
0.67
2 [7] V. Jampani, Huiwen Chang, Kyle Sargent, Abhishek Kar, Richard Tucker, Michael Krainin, Dominik Philemon Kaeser, William T. Freeman, D. Salesin, Brian Curless, and Ce Liu.
2 V.Jampani, Huiwen Chang, Kyle Sargent, Abhishek Kar, Richard Tucker, Michael Krainin, Dominik Philemon Kaeser, William T. Freeman, D. Salesin, Brian Curless, Ce Liu 訳抜け防止モード: 2 [7 ]V.Jampani,Huiwen Chang,Kyle Sargent, Abhishek Kar, Richard Tucker, Michael Krainin, Dominik Philemon Kaeser ウィリアム・T・フリーマン(William T. Freeman)、D. Salesin、Brian Curless、Ce Liu。
0.66
SLIDE: Single image 3d photography with soft layering and depth-aware inpainting.
SLIDE: ソフトな層化と深度対応のインペイントを備えたシングルイメージ3D写真。
0.60
In ICCV, 2021.
ICCV、2021年。
0.67
2, 4 [8] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik G. Learned-Miller, and Jan Kautz.
2, 4 8]huaizu jiang、deqing sun、varun jampani、ming-hsuan yang、erik g. learned-miller、jan kautz。
0.51
Super slomo: High quality estimation of multiple intermediate frames for video interpolation.
Super slomo:ビデオ補間のための複数の中間フレームの高品質推定
0.84
In CVPR, pages 9000–9008, 2018.
CVPRでは2018年9000-9008頁。
0.71
2 [9] Justin Johnson, Alexandre Alahi, and Li Fei-Fei.
2 9]ジャスティン・ジョンソン、アレクサンドル・アラヒ、リー・フェイ=フェイ。
0.47
Perceptual losses for real-time style transfer and super-resolution.
リアルタイム型転送と超解像における知覚的損失
0.56
In European conference on computer vision, pages 694–711.
6 [11] Johannes Kopf, Kevin Matzen, Suhib Alsisan, Ocean Quigley, Francis Ge, Yangming Chong, Josh Patterson, Jan-Michael Frahm, Shu Wu, Matthew Yu, Peizhao Zhang, Zijian He, P´eter Vajda, Ayush Saraf, and Michael F. Cohen.
6 11]Johannes Kopf, Kevin Matzen, Suhib Alsisan, Ocean Quigley, Francis Ge, Yangming Chong, Josh Patterson, Jan-Michael Frahm, Shu Wu, Matthew Yu, Peizhao Zhang, Zijian He, P ́eter Vajda, Ayush Saraf, Michael F. Cohen。 訳抜け防止モード: 6 11] ヨハネス・コップ ケヴィン・マツェン スヒブ・アシサン ocean quigley、francis ge、yangming chong、josh patterson。 jan - michael frahm、shu wu、matthew yu、peizhao zhang。 zijian he, p'eter vajda, ayush saraf, michael f. cohen。
0.47
One shot 3d photography. ACM Transactions on Graphics (TOG), 39:76:1 – 76:13, 2020.
撮影は3d撮影。 ACM Transactions on Graphics (TOG) 39:76:1 – 76:13, 2020。
0.54
1, 2, 3 [12] Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, and Sangyoun Lee.
1, 2, 3 【12】李百民、キムテオ、テヨンチュン、ダヒョンパク、バンユソク、リーさんぎん
0.68
Adacof: Adaptive collaboration of flows for video frame interpolation.
adacof:ビデオフレーム補間のためのフローの適応協調。
0.83
In CVPR, pages 5316–5325, 2020.
CVPRでは、2020年5316-5325頁。
0.66
2 [13] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, S. Lovegrove, Michael Goesele, and Zhaoyang Lv.
2 13] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, S. Lovegrove, Michael Goesele, Zhaoyang Lv。 訳抜け防止モード: 2 13] tianye li, mira slavcheva, michael zollhoefer, シモン・グリーン、クリストフ・ラスナー、チャンジル・キム、タナー・シュミット s. lovegrove、michael goesele、zhaoyang lv。
0.52
Neural 3d video synthesis.
ニューラル3dビデオ合成。
0.77
ArXiv, abs/2103.02597, 2021.
ArXiv, abs/2103.02597, 2021。
0.35
2 [14] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman.
4 [20] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng.
4 Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, Ren Ng. 訳抜け防止モード: 4 20 ]ben mildenhall, pratul p srinivasan, matthew tancik, ジョナサン・t・バロン(jonathan t barron)、ラヴィ・ラマムーアティ(ravi ramamoorthi)、レン・ヨン(ren ng)。
0.47
Nerf: Representing scenes as neural radiance fields for view synthesis.
nerf: シーンを、ビュー合成のためのニューラルラミアンスフィールドとして表現する。
0.65
ECCV, 2020.
ECCV、2020年。
0.87
2 [21] Simon Niklaus and Feng Liu.
2 [21]Simon NiklausとFeng Liu。
0.40
Softmax splatting for video frame interpolation.
フレーム補間のためのソフトマックススプレイティング
0.57
In CVPR, pages 5436–5445, 2020.
CVPR』5436-5445、2020年。
0.71
2, 5 [22] Simon Niklaus, Long Mai, and Feng Liu.
2, 5 [22] シモン・ニクラウス ロング・マイ フェン・リウ
0.40
Video frame interpolation via adaptive convolution.
適応的畳み込みによるビデオフレーム補間
0.79
In CVPR, pages 2270–2279, 2017.
CVPR 2017年、2270-2279頁。
0.77
2 [23] Simon Niklaus, Long Mai, and Feng Liu.
2 [23]Simon Niklaus、Long Mai、Feng Liu。
0.37
Video frame interpolation via adaptive separable convolution.
適応的分離可能な畳み込みによるビデオフレーム補間
0.70
In ICCV, pages 261–270, 2017.
ICCV、2017年261-270頁。
0.79
2 [24] Simon Niklaus, Long Mai, and Oliver Wang.
2 24]サイモン・ニクラウス ロング・マイ オリバー・ワン
0.42
Revisiting adaptive convolutions for video frame interpolation.
ビデオフレーム補間における適応畳み込みの再検討
0.72
arXiv preprint arXiv:2011.01280, 2020.
arxiv プレプリント arxiv:2011.01280, 2020
0.42
2 [25] Simon Niklaus, Long Mai, Jimei Yang, and F. Liu.
2 25]simon niklaus、long mai、jimei yang、f. liu。
0.44
3d ken burns effect from a single image.
3d kenは、単一の画像から効果を燃やす。
0.65
ACM TOG, 38:1 – 15, 2019.
acm tog, 38:1 - 15 2019年。
0.73
1, 2 [26] Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim.
Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation.
Bmbc:ビデオ補間のための両側コスト容積を用いた両側運動推定
0.80
In ECCV, pages 109–125.
ECCVでは109-125頁。
0.75
Springer, 2020.
スプリンガー、2020年。
0.59
2 [27] Keunhong Park, U. Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo Mart´ın Brualla.
2 [27]Keunhong Park, U. Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M. Seitz, Ricardo Mart ́ın Brualla。 訳抜け防止モード: 2 [27 ]清華公園, U. Sinha, Jonathan T. Barron Sofien Bouaziz、Dan B. Goldman、Steven M. Seitz リカルド・マート(Ricardo Mart)とも。
0.60
Deformable neural radiance fields.
変形可能な神経放射場。
0.59
In ICCV, 2021.
ICCV、2021年。
0.67
1, 2, 6 [28] Keunhong Park, U. Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin-Brualla, and Steven M. Seitz.
1, 2, 6 Keunhong Park, U. Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin-Brualla, そしてSteven M. Seitz。 訳抜け防止モード: 1, 2, 6 [28 ]Keunhong Park, U. Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin - Brualla スティーブン・M・セイッツ。
0.65
Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields.
Hypernerf: トポロジカルに変化する神経放射場の高次元表現。
0.75
SIGGRAPH Asia, abs/2106.13228, 2021.
SIGGRAPH Asia, abs/2106.13228, 2021
0.41
1, 2 [29] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer.
Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera.
モノクラーカメラからの世界的コヒーレントな深度を持つ動的シーンの新しいビュー合成
0.70
In CVPR, pages 5336–5345, 2020.
CVPRでは、2020年5336-5345頁。
0.66
2, 6, 7 [53] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
2, 6, 7 [53]アレックス・ユ、ヴィッキー・イェ、マシュー・タンシク、アンジョオ・金沢
0.50
pixelnerf: Neural radiance fields from one or few images.
pixelnerf: 1つまたは少数の画像からの神経放射場。
0.70
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
The Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 4578–4587, 2021。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 4578-4587頁、2021年。
0.85
2 [54] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun.
2[54] Kai Zhang, Gernot Riegler, Noah Snavely, Vladlen Koltun。
0.35
Nerf++: Analyzing and improving neural radiance fields.
Nerf++: 神経放射場の解析と改善。
0.70
arXiv preprint arXiv:2010.07492, 2020.
arxiv プレプリント arxiv:2010.07492, 2020
0.44
2 [55] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
2 [55]Richard Zhang、Phillip Isola、Alexei A Efros、Eli Shechtman、Oliver Wang。
0.38
The unreasonable effectiveness of deep features as a perceptual metric.
深い特徴を知覚的計量として不合理な有効性を持つ。
0.57
In CVPR, pages 586–595, 2018.
CVPR』586-595頁、2018年。
0.76
5, 6 [56] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely.
3d photography using context-aware layered depth inpainting.
テクスチャ対応層状深度塗布による3D写真
0.64
In CVPR, pages 8028–8038, 2020.
CVPR』8028-8038、2020年。
0.72
1, 2, 3, 4, 5, 6, 7
1, 2, 3, 4, 5, 6, 7
0.43
[39] Hyeonjun Sim, Jihyong Oh, and Munchurl Kim.
[39]ヒョンジュン・シム、ジヒョン・オ、ムンチャル・キム
0.56
Xvfi: ex- treme video frame interpolation.
xvfi: 元 treme video frame interpolation (英語)
0.69
In ICCV, 2021.
ICCV、2021年。
0.67
2, 6, 7 [40] Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely.
2, 6, 7 Pratul P Srinivasan氏、Richard Tucker氏、Jonathan T Barron氏、Ravi Ramamoorthi氏、Ren Ng氏、Noah Snavely氏。
0.57
Pushing the boundaries of view extrapolation with multiplane images.
マルチプレーン画像によるビュー外挿の境界を押し上げる。
0.66
In CVPR, pages 175–184, 2019.
CVPR』175-184頁、2019年。
0.74
2 [41] Timo Stich, Christian Linz, Georgia Albuquerque, and Marcus Magnor.
2 ティモ・スティッチ、クリスチャン・リンツ、ジョージア・アルバカーキ、マルクス・マグナー。
0.43
View and time interpolation in image space.
画像空間におけるビューと時間補間
0.80
In Computer Graphics Forum, volume 27, pages 1781–1787.
コンピュータグラフィックスフォーラム』第27巻1781-1787頁。
0.65
Wiley Online Library, 2008.
ウィリー・オンライン図書館、2008年。
0.49
2 [42] Zachary Teed and Jia Deng.
2 [42] ザカリー・ティードと ジア・デン
0.42
Raft: Recurrent all-pairs field In ECCV, pages 402–419.
ラフト: 再帰する全ペアフィールド ECCV、ページ402-419。
0.65
transforms for optical flow. Springer, 2020.
光の流れの変換です スプリンガー、2020年。
0.70
3, 5, 6 [43] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner, R. Pandey, S. Fanello, G. Wetzstein, J.
3, 5, 6 [43] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner, R. Pandey, S. Fanello, G. Wetzstein, J。
0.43
-Y. Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B Goldman, and M. Zollh¨ofer.
-y。 Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B Goldman, M. Zollh sofer
0.42
State of the Art on Neural Rendering.
ニューラルレンダリング技術の現状
0.32
Computer Graphics Forum (EG STAR 2020), 2020.
コンピュータグラフィックスフォーラム(EG STAR 2020)、2020年。
0.80
2 [44] Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan, Edgar Tretschk, Yifan Wang, Christoph Lassner, Vincent Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi, Tomas Simon, Christian Theobalt, Matthias Niessner, Jonathan T. Barron, Gordon Wetzstein, Michael Zollhoefer, and Vladislav Golyanik.
Layerstructured 3d scene inference via view synthesis.
ビュー合成による階層構造3次元シーン推論。
0.72
In ECCV, pages 302–317, 2018.
ECCV』302-317頁、2018年。
0.75
2 [47] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo MartinBrualla, Noah Snavely, and Thomas Funkhouser.
2 [47]Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo MartinBrualla, Noah Snavely, Thomas Funkhouser。 訳抜け防止モード: 2 47] キアンキアン・ワン、ジチェン・ワン、カイル・ジェノヴァ pratul srinivasan, howard zhou, jonathan t. barron, ricardo martinbrualla, noah snavelyとthomas funkhouserだ。