Deep learning algorithms are rapidly changing the way in which audiovisual
media can be produced. Synthetic audiovisual media generated with deep learning
- often subsumed colloquially under the label "deepfakes" - have a number of
impressive characteristics; they are increasingly trivial to produce, and can
be indistinguishable from real sounds and images recorded with a sensor. Much
attention has been dedicated to ethical concerns raised by this technological
development. Here, I focus instead on a set of issues related to the notion of
synthetic audiovisual media, its place within a broader taxonomy of audiovisual
media, and how deep learning techniques differ from more traditional approaches
to media synthesis. After reviewing important etiological features of deep
learning pipelines for media manipulation and generation, I argue that
"deepfakes" and related synthetic media produced with such pipelines do not
merely offer incremental improvements over previous methods, but challenge
traditional taxonomical distinctions, and pave the way for genuinely novel
kinds of audiovisual media.
Synthetic audiovisual media generated with deep learning – often subsumed colloquially under the label “deepfakes” – have a number of impressive characteristics; they are increasingly trivial to produce, and can be indistinguishable from real sounds and images recorded with a sensor.
Much attention has been dedicated to ethical concerns raised by this technological development.
この技術開発によって引き起こされた倫理的懸念に多くの注意が向けられている。
0.57
Here, I focus instead on a set of issues related to the notion of synthetic audiovisual media, its place within a broader taxonomy of audiovisual media, and how deep learning techniques differ from more traditional approaches to media synthesis.
After reviewing important etiological features of deep learning pipelines for media manipulation and generation, I argue that “deepfakes” and related synthetic media produced with such pipelines do not merely offer incremental improvements over previous methods, but challenge traditional taxonomical distinctions, and pave the way for genuinely novel kinds of audiovisual media.
Keywords deepfakes, deep learning, media synthesis, disinformation, art
キーワード:Deepfakes、Deep Learning、メディア合成、偽情報、アート
0.73
1 Introduction Recent research in artificial intelligence has been dominated by deep learning (DL), a class of algorithms inspired by biological neural networks that can learn automatically to perform certain tasks from large amounts of data.
Much work has been dedicated to developing DL algorithms capable of perceiving and exploring (real or virtual) environments, processing and understanding natural language, and even reasoning.
However, a significant amount of recent work in DL has also gone towards crafting algorithms that can synthesize novel audiovisual media, with remarkable success.
Nonetheless, the recent progress of DL in the domain of audiovisual media synthesis is rapidly changing the way in which we approach media creation, whether for communication, entertainment, or artistic purposes.
An impressive and salient example of this progress can be found in so-called “deepfakes”, a portmanteau word formed from “deep learning” and “fake” (Tolosana et al [2020]).
ディープフェイク(deepfakes)とは、「ディープラーニング(deep learning)」と「フェイク」(Tolosana et al (2020))から作られるポルトマントー語である。 訳抜け防止モード: この進歩の印象的かつ有意義な例は、"deepfakes"と呼ばれるsoで見ることができる。 ”deep learning”と“fake ”(tolosana et al [2020 ])から形成されたportmanteauの単語。
0.77
This term originated in 2017, from the name of a Reddit user who developed a method based on DL to substitute the face of an actor or actress in pornographic videos with the face of a celebrity.
However, since its introduction, the term “deepfake” has been generically applied to videos in which faces have been replaced or otherwise digitally altered with the help of DL algorithms, and even more broadly to any DL-based manipulations of sound, image and video.
While deepfakes have recently garnered attention in philosophy, the discussion has mostly focused on their potentially harmful uses, such as impersonating identities and disseminating false information (Floridi [2018], de Ruiter [2021]), undermining the epistemic and testimonial value of photographic media and videos (Fallis [2020], Rini [2020]), and damaging reputations or furthering gender inequality through fake pornographic media (Öhman [2020]).
ディープフェイクは近年哲学に注目が集まっているが、議論は主に、身元を偽装したり、偽情報を広めたり(floridi [2018], de ruiter [2021])、写真メディアやビデオの認識的・証言的価値を損なう(fallis [2020], rini [2020])、偽ポルノメディアによる評判を傷つけたり、性格差を増す(öhman [2020])といった潜在的に有害な利用に焦点を当てている。
0.70
These ethical and epistemic concerns are significant, and warranted.
これらの倫理的および疫学的な懸念は重要であり、保証されている。
0.42
However, deepfakes and similar techniques also raise broader
しかし ディープフェイクや同様の技術は
0.50
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese issues about the notion of synthetic audiovisual media, as DL algorithms appear to challenge traditional distinctions between different kinds of media synthesis.
Given the lack of a single clear definition of deepfakes, it is helpful to focus on the more explicit notion of DL-based synthetic audiovisual media, or DLSAM for short.
Drawing upon the sub-categories of DLSAM informed by this taxonomy, I will discuss the extent to which they represent a qualitative change in media creation that should lead us to expand our understanding of synthetic audiovisual media, or whether they constitute merely incremental progress over traditional approaches (§4).
It may refer to physical materials (e g , tape, disk, or paper) used for recording or reproducing data (storage media); or, by extension, to the format in which data is stored, such as JPEG or MP3 for digital image and sound respectively (media format).
It may also refer more broadly to the kinds of data that can be recorded or reproduced in various materials and formats (media type); in that sense, text, image, and sound are distinct media, even though they may be stored in the same physical substrate (e g , a hard drive).1
By “audiovisual media”, I will generally refer to artifacts or events involving sound, still images, moving images, or a combination of the above, produced to deliver auditory and/or visual information for a variety of purposes, including communication, art, and entertainment.
For convenience, I will refer to media involving sound only as “auditory media”, media involving still images only as “static visual media”, and media involving moving images (with or without sound) as “dynamic visual media”.
Audiovisual media fall into two broad etiological categories: hand-made media, produced by hand or with the help of manual tools (e g , paint brushes), and machine-made media, produced with the help of more sophisticated devices whose core mechanism is not, or not merely, hand-operated (e g cameras and computers) (fig. 1).
Archival media are brought about by real objects and events in a mechanical manner.
アーカイブメディアは、実際のオブジェクトやイベントによって機械的にもたらされる。
0.80
They can be said to “record” reality in so far as they capture it through a process that is not directly mediated by the producer’s desires and beliefs.
More specifically, they are counterfactually dependent upon the real objects or events that bring them about, even if the intentional attitudes of any human involved in producing them are held fixed.
What someone believes they are capturing when pressing a button on a camera or microphone is irrelevant to what the camera or microphone will in fact record.
This property of archival audiovisual media corresponds more or less to what Kendall Walton has characterized as “transparency” with respect to photography in particular (Walton [1984]).
Paintings, no matter how realistic, do not put us in perceptual contact with reality in this way, because they are not mechanically caused by objects in the painter’s environment; rather, they might be caused by them only indirectly, through the meditation of the painter’s intentional attitudes (e g , the painter’s belief that the object they are attempting to depict looks a certain way).2
While Walton focused on the case of archival visual media, it has been argued that raw audio recordings also transparent in that sense (Mizrahi [2020]).
By contrast with archival audiovisual media, synthetic audiovisual media do not merely record real objects and events; instead, their mode of production intrinsically involves a generative component.
In turn, these media can be partially or totally synthetic (fig. 1).
逆に、これらの媒体は部分的にあるいは完全に合成することができる(図1)。
0.68
The former involve the modification – through distortion, combination, addition, or subtraction – of archival media: while they involve a generative component, their also involve source material that has not been generated but recorded.
The latter are entirely generative: they involve the creation of new sounds, images, or videos that do not directly incorporate archival media, even if they might be inspired by them.
More specifically, there are two ways in which audiovisual media can be only partially synthetic, depending on whether they result from a global or local manipulation of archival media.
Global manipulations involve applying an effect to an entire sound recording, photograph, or video.
グローバルな操作は、全録音、写真、ビデオに効果を加えることを含む。
0.69
For example, one can apply a filter to an audio signal to modify its loudness, pitch, frequency ranges, and reverberation, or thoroughly distort it (which is traditionally done with effects pedals and amplifiers in some music genres).
particular medium is partially constituted by that medium.
特定の媒体は、その媒体によって部分的に構成される。
0.47
2There is some debate about whether photographs are actually transparent in Walton’s sense ([e g , Currie, 1991]).
2 写真が実際にウォルトンの意味において透明であるかどうかについては議論がある([e g , Currie, 1991])。
0.73
2
2
0.42
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese to adjust various parameters such as hue, brightness, contrast, and saturation, or apply uniform effects like Gaussian blur or noise.
By contrast, local manipulations involve modifying, removing, or replacing proper parts of archival audiovisual media instead of adjusting global parameters.
In the visual domain, image editing software like Adobe Photoshop can be used to manipulate parts of images, and VFX software can be used to manipulate parts of video recordings through techniques like rotoscoping, compositing, or the integration of computer-generated imagery (CGI).
Totally synthetic audiovisual media are not produced by modifying pre-existing archival media, but consist instead in generating entirely novel sound or imagery.
Traditional forms of synthetic media include electronic music and sound effects generated with synthesizers or computers, computer-generated 3D rendering or digital illustration, and animated videos.
Note that the general distinction between archival and synthetic audiovisual media is orthogonal to the distinction between analog and digital signals.
Analog recording methods store continuous signals directly in or on the media, as a physical texture (e g , phonograph recording), as a fluctuation in the field strength of a magnetic recording (e g , tape recording), or through a chemical process that captures a spectrum of color values (e g , film camera).
Digital recording methods involve quantizing an analog signal and representing it as discrete numbers on a machine-readable data storage.
デジタル記録方式では、アナログ信号を量子化し、機械可読データストレージ上の離散数として表現する。
0.68
Archival media can be produced through either analog (e g , tape recorder, film camera) or digital (e g , digital microphone and camera) recording methods.
Likewise, some synthetic media can be produced through analog means, as shown by the famous 1860 composite portrait of Abraham Lincoln produced with lithographs of Lincoln’s head and of John Calhoun’s body.
There are also a few edge cases in which the distinction between archival and synthetic media becomes less obvious, such as artworks involving collages of photographs in which no part of the source material is removed or occluded.
DL is currently the most prominent method in research on artificial intelligence, where it has surpassed more traditional techniques in various domains including computer vision and natural language processing.
It is part of a broader family of machine learning methods using so-called artificial neural networks, loosely inspired by the mammalian brain, that can learn to represent features of data for various downstream tasks such as detection or classification.
Deep learning specifically refers to machine learning methods using deep artificial neural networks, whose units are organized in multiple processing layers between input and output, enabling them to efficiently learn representations of data at several levels of abstraction (LeCun et al [2015], Buckner [2019]).
ディープラーニング(deep learning)は、入力と出力の間に複数の処理層で構成されたディープニューラルネットワークを使用して、さまざまなレベルの抽象化(lecun et al [2015], buckner [2019])でデータの表現を効率的に学習する機械学習手法である。
0.83
Deep neural networks can be trained end-to-end: given a large enough training dataset, they can learn automatically how to perform a given task with a high success rate, either through labeled samples (supervised learning) or from raw untagged data (unsupervised learning).
Given enough training data and computational power, DL methods have proven remarkably effective at classification tasks, such as labeling images using many predetermined classes like “African elephant” or “burrito” (Krizhevsky et al [2017]).
十分な訓練データと計算能力から、DL法は「アフリカゾウ」や「ブリトー」など多くの所定のクラスを用いた画像のラベル付けなど、分類作業において極めて効果的であることが証明されている(Krizhevsky et al [2017])。
0.72
However, the recent progress of DL has also expanded to the manipulation and synthesis of sound, image, and video, with so-called “deepfakes” (Tolosana et al [2020]).
しかし,近年のDLの進歩は,いわゆるディープフェイク(Tolosana et al (2020])による音声,画像,映像の操作・合成にも及んでいる。
0.65
As mentioned at the outset, I will mostly leave this label aside to focus on the more precise category of DL-based synthetic audiovisual media, or DLSAM for short.
Any kind of observed data, such as speech or images, can be thought of as finite set of samples from an underlying probability distribution in a (typically high-dimensional) space.
For example, the space of possible color images made of 512x512 pixels has no less than 786,432 dimensions – three dimensions per pixel, one for each of the three channels of the RGB color space.
Any given 512x512 image can be thought of as a point within that high-dimensional space.
任意の 512x512 の像は、その高次元空間内の点と考えることができる。
0.76
Thus, all 512x512 images of a given class, such as dog photographs, or real-world photographs in general, have a specific probability distribution within R786432.
Forthcoming in Synthese Figure 2: The simplified architectures of a VAE and a GAN.
合成のこれから 図2:VAEとGANの単純化されたアーキテクチャ。
0.68
sample, and, importantly, to create new samples that are similar to samples from the learned probability distribution (this is the generative component of the model).
More precisely, deep generative models learn an intractable probability distribution X defined over Rn, where X is typically complicated (e g , disjoint), and n is typically large.
より正確には、深部生成モデルは Rn 上で定義される難解確率分布 X を学習し、X は典型的には複雑(例: , disjoint)で、n は典型的には大きい。
0.77
A large but finite number of independent samples from X are used as the model’s training data.
モデルのトレーニングデータには、Xからの大量の独立したサンプルが使用されている。
0.79
The goal of training is to obtain a generator that maps samples from a tractable probability distribution Z in Rq to points in Rn that resemble them, where q is typically smaller than n.
訓練のゴールは、Rq の抽出可能な確率分布 Z からそれらに似た Rn の点へサンプルを写像する生成元を得ることである。 訳抜け防止モード: 訓練の目標は、Rq 内の抽出可能な確率分布 Z から Rn に類似した点へのサンプルをマッピングする生成元を得ることである。 通常 q は n より小さい
0.81
Z is called the latent space of the model.
Z はモデルの潜在空間と呼ばれる。
0.72
After training, the generator can generate new samples in X (e g , 512x512 images) from the latent space Z. There are two main types of deep generative models: variational autoencoders (VAEs), used for example in most traditional “deepfakes” (Kingma [2013], Rezende et al [2014]); and generative adversarial networks (GANs), used in other forms of audiovisual media synthesis (Goodfellow et al [2014]).
例えば、ほとんどの伝統的な「ディープフェイク」(Kingma [2013], Rezende et al [2014])や、他のオーディオ視覚メディア合成法(Goodfellow et al [2014])の2種類の深層生成モデルがある。 訳抜け防止モード: トレーニング後、このジェネレータは潜伏空間ZからX(例:512x512画像)で新しいサンプルを生成することができる。 たとえば、ほとんどの伝統的な「ディープフェイク」で使われる(Kingma (2013 ]) Rezende et al [ 2014 ] ) および生成的敵ネットワーク(GAN) 他の形式のオーディオビジュアルメディア合成で使用される(Goodfellow et al [ 2014 ] )。
0.54
VAEs have two parts: an encoder and decoder (fig. 2, top).
VAEにはエンコーダとデコーダ(図2、上)の2つの部分がある。
0.76
They learn the probability distribution of the data by encoding training samples into a low-dimensional latent space, then decoding the resulting latent representations to reconstruct them as outputs, while minimizing the difference between real input and reconstructed output.
By contrast, GANs have a game theoretic design that includes two different sub-networks, a generator and a discriminator, competing with each other during training (fig. 2, bottom).
The generator is trained to generate new samples, while the discriminator is trained to classify samples as either real (from the training data) or fake (produced by the generator).
The generator’s objective is to “fool” the discriminator into classifying its outputs as real, that is, to increase the error rate of the discriminator.
Over time, the discriminator gets better at detecting fakes, and in return samples synthesized by the generator get better at fooling the discriminator.
After a sufficient number of training events, the generator can produce realistic outputs that capture the statistical properties of the dataset well enough to look convincing to the discriminator, and often to humans.
Forthcoming in Synthese Figure 3: An example of super-resolution (adapted from Chan et al [2021]).
合成のこれから 図3: 超解像の例(chan et al [2021]から適応)。
0.67
The intricacies of deep generative models have significant implications for our understanding the nature of DLSAM, as well as future possibilities for synthetic media.
Before discussing these implications, I will give an overview of the main kinds of DLSAM that can be produced with existing DL algorithms.
これらの意味を議論する前に、既存のDLアルゴリズムで生成できる主なDLSAMについて概説する。
0.62
Using the taxonomy introduced in §2, we can distinguish three categories of DLSAM:
2 で導入された分類を用いて、DLSAM の3つのカテゴリを区別できる。
0.64
(a) global partially synthetic DLSAM,
(a)グローバル部分合成DLSAM
0.33
(b) local partially synthetic DLSAM, and
(b)局所的な部分合成dlsam、及び
0.77
(c) totally synthetic DLSAM (fig. 1).
(c)全合成DLSAM(図1)
0.62
As we shall see, the original “deepfakes”, consisting in replacing faces in videos, can be viewed as instances of the second category of DLSAM, and hardly span the full spectrum of DL-based methods to alter or generate audiovisual media.
While it is useful, at a first approximation, to locate different types of DLSAM within the traditional taxonomy of audiovisual media, it will become apparent later that a deeper understanding of their etiology challenges some categorical distinctions upon which this taxonomy is premised.
Examples in the auditory domain include audio enhancement and voice conversion.
聴覚領域の例としては、音声強調と音声変換がある。
0.58
Audio enhancement straightforwardly consists in enhancing the perceived quality of an audio file, which is especially useful for noisy speech recordings (Hu et al [2020]), while voice conversion consists in modifying the voice of the speaker in a recording to make them sound like that of another speaker without altering speech content (Huang et al [2020]).
音声強調は、特にノイズの多い音声録音に有用な音声ファイルの品質の向上(Hu et al [2020])と、音声変換は、音声内容を変更することなく、他の話者の音声に聞こえるように録音中の話者の声を変更すること(Huang et al [2020])から構成される(Huang et al [2020])。
0.74
In the visual domain, this category includes both static and dynamic visual enhancement and style transfer.
ビジュアル領域では、このカテゴリには静的および動的視覚拡張とスタイル転送の両方が含まれる。
0.76
Like its auditory counterpart, visual enhancement consists in improving the perceived quality of images and videos.
聴覚機能と同様、視覚的強化は画像やビデオの品質の向上に寄与する。
0.70
It encompasses “denoising”, or removing noise from images/videos (Zhang et al [2016]); reconstituting bright images/videos from sensor data in very dark environments (Chen et al [2018]); “debluring”, or removing visual blur (Kupyn et al [2018]); restoring severely degraded images/videos (Wan et al [2020]); “colorization”, or adding colors to black-and-white images/videos (Kumar et al [2020]); and “super-resolution”, or increasing the resolution of images/videos to add missing detail (Ledig et al [2017], see fig. 3).
It encompasses “denoising”, or removing noise from images/videos (Zhang et al [2016]); reconstituting bright images/videos from sensor data in very dark environments (Chen et al [2018]); “debluring”, or removing visual blur (Kupyn et al [2018]); restoring severely degraded images/videos (Wan et al [2020]); “colorization”, or adding colors to black-and-white images/videos (Kumar et al [2020]); and “super-resolution”, or increasing the resolution of images/videos to add missing detail (Ledig et al [2017], see fig. 3). 訳抜け防止モード: や、画像やビデオからノイズを取り除く(Zhang et al [2016 ])、非常に暗い環境でセンサーデータから明るい画像やビデオを再構成する(Chen et al [ 2018 ])、”debluring ”などだ。 あるいは、視覚的ぼかし(Kupyn et al [ 2018 ])、ひどく劣化した画像/ビデオ(Wan et al [ 2020 ])を復元する(Wan et al [ 2020 ] )。 あるいは、黒と白の画像/ビデオ(Kumar et al [2020 ])に色を加えるか、”Super - resolution ”などだ。 あるいは、画像やビデオの解像度を拡大して、詳細を欠いている(Ledig et al [ 2017 ])。 図 3 を参照。
0.79
Visual style transfer consists in changing the style of an image/video in one domain, such as a photograph, to the style of an image/video in another domain, such as a painting, while roughly preserving its compositional structure (Gatys et al [2015]).
視覚的スタイルの転送は、写真などのある領域における画像・ビデオのスタイルを、その構成構造を概ね保存しながら、絵画などの他の領域における画像・ビデオのスタイルに変更することである(Gatys et al [2015])。
0.73
3.2 Local partially synthetic DLSAM
3.2 部分合成DLSAM
0.42
Audiovisual media produced by altering local properties of existing media with DL algorithms can be subsumed under this second category of DLSAM.
In the auditory domain, this concerns in particular audio files produced through source separation.
聴覚領域では、ソース分離によって生成された特定のオーディオファイルに懸念がある。
0.62
Speech source separation consists in extracting overlapping speech sources in a given mixed speech signal as separate signals (Subakan et al [2021]), while music source separation consists in decomposing musical recordings into their constitutive components, such as generating separate tracks for the vocals, bass, and drums of a song (Hennequin et al [2020]).
音源分離は、与えられた混合音声信号中の重なり合う音源を分離信号として抽出する(subakan et al [2021])一方、音源分離は、歌のボーカル、ベース、ドラムの分離トラックを生成する(hennequin et al [2020])などの構成成分に音楽録音を分解するものである。
0.75
Auditory media produced through source separation are instances of local partially synthetic media, insofar as they involve removing parts of a recording while preserving others, instead of applying a global transformation to the recording as a whole.
In the visual domain, this category encompasses images and videos produced through “deepfakes” in the narrow sense, as well as inpainting, and attribute manipulation.
In this context, “deepfake” refers to face swapping, head puppetry, or lip syncing.
この文脈では、ディープフェイク(deepfake)とは、顔の入れ替え、頭人形、唇の同期などを指す。
0.55
Face swapping is the method behind the original meaning of the term (Tolosana et al [2020]).
顔交換は、この用語の本来の意味の背後にある方法である(Tolosana et al [2020])。
0.69
It consists in replacing a subject’s face in images or videos with someone else’s (fig. 4).
画像やビデオの中の被写体の顔を他人の顔に置き換える(図4)。
0.56
State-of-the-art pipelines for face swapping are fairly complex, involving three steps: an extraction step to retrieve faces from sources images and from the target image or video (this requires a mixture of face detection, facial landmark extraction to align detected
Forthcoming in Synthese Figure 4: Face swapping “deepfake” (adapted from Perov et al [2021]).
合成のこれから 図4: 顔の交換 “deepfake” (perov et al [2021]から適応)。
0.69
faces, and face segmentation to crop them from images); a training step that uses an autoencoder architecture to create latent representations of the source and target faces with a shared encoder; and a conversion step to re-align the decoded (generated) faces with the target, blend it, and sharpen it (Perov et al [2021]).
顔と顔のセグメンテーションを画像から取り出す)、自動エンコーダアーキテクチャを使用して、共有エンコーダでソースとターゲットの顔の潜在表現を作成するトレーニングステップ、そして、デコードされた(生成された)顔とターゲットと再結合し、それをブレンドし、研削する変換ステップ(perov et al [2021])。
0.67
Head puppetry or “talking head generation” is the task of generating a plausible video of a talking head from a source image or video by mimicking the movements and facial expressions of a reference video (Zakharov et al [2019]), while lip syncing consists in synchronizing lip movements on a video to match a target speech segment (Prajwal et al [2020]).
ヘッド・パペットリー(head puppetry)または「トーキング・ヘッド・ジェネレーション(talking head generation)」とは、参照ビデオ(zakharov et al [2019])の動作や表情を模倣して、ソース画像やビデオから話し手の動きを再現する作業であり、唇の同期は、ビデオ上の唇の動きを同期させてターゲット音声セグメントと一致させる(prajwal et al [2020])。
0.80
Head puppetry and lip syncing are both forms of motion transfer, which refers more broadly to the task of mapping the motion of a given individual in source video to the motion of another individual in a target image or video (Zhu et al [2021], Kappel et al [2021]).
頭部の人形と唇の同期はどちらも運動伝達の一種であり、より広い意味では、ソースビデオ中の特定の個人の動きをターゲット画像またはビデオ内の別の個人の動きにマッピングするタスクを指す(zhu et al [2021], kappel et al [2021])。
0.75
Face swapping, head puppetry, and lip syncing are commonly referred to as “deepfakes” because they can be used to usurp someone’s identity in a video; however, they involve distinct generation pipelines.
Inpainting involves reconstructing missing regions in an image or video sequence with contents that are spatially and temporally coherent (Yu et al [2019], Xu et al [2019]).
Inpaintingは、空間的かつ時間的に一貫性のあるコンテンツ(Yu et al [2019], Xu et al [2019])で、画像またはビデオシーケンスの欠落した領域を再構成する。
0.79
Finally, attribute manipulation broadly refers to a broad range of techniques designed to manipulate local features of images and videos.
最後に、属性操作は、画像やビデオの局所的な特徴を操作するために設計された幅広い技術を指す。
0.59
Semantic face editing or facial manipulation consists in manipulating various attributes in an headshot, including gender, age, race, pose, expression, presence of accessories (eyewear, headgear, jewelry), hairstyle, hair/skin/eye color, makeup, as well as the size and shape of any part of the face (ears, nose, eyes, mouth, etc.) (Lee et al [2020], Shen et al [2020], Viazovetskyi et al [2020]).
意味的な顔の編集や顔の操作は、性別、年齢、人種、ポーズ、表情、アクセサリー(眼鏡、ヘッドギア、ジュエリー)の存在、髪型、髪型、肌、眼の色、化粧、顔のどの部分(耳、鼻、目、口など)のサイズと形状(lee et al [2020]、shen et al [2020]、viazovetskyi et al [2020])など、ヘッドショットの様々な属性を操作することによって行われる。
0.74
Similar techniques can be used to manipulate the orientation, size, color, texture, and shape of objects in an image more generally (Shen and Zhou [2021]).
As we shall see, this can even be done by using a linguistic description to guide the modification of high-level and abstract properties of persons or objects in an image, e g adding glasses to a photograph of a face with the caption “glasses” (Patashnik et al [2021], fig. 5).
ご覧のように、画像中の人物や物体の高度で抽象的な性質の修正を導くために、言語的な記述を使用することで、これを行うことができる。例えば、顔の写真に「ガラス」(patashnik et al [2021],図5)で眼鏡を加えるなどである。
0.82
3.3 Totally synthetic DLSAM
3.3 総合成DLSAM
0.67
In this last category are audiovisual media entirely synthesized with the help of DL algorithms, rather than produced by altering pre-existing media.
Such synthesis can be conditional, when samples are generated conditionally on labels from the dataset used for training, or unconditional, when samples are generated unconditionally from the dataset.
In the auditory domain, totally synthetic DLSAM include speech synthesis, which consists in generating speech from some other modality like text (text-to-speech) or lip movements that can be conditioned on the voice of a specific speaker (Shen et al [2018]); and music generation, which consists in generating a musical piece that can be conditioned on specific lyrics, musical style or instrumentation (Dhariwal et al [2020]).
聴覚領域において、完全に合成されたdlsamは、テキスト(テキストから音声への)や特定の話者の声で条件付け可能な唇の動きといった他のモダリティから音声を生成する音声合成(shen et al [2018])と、特定の歌詞、音楽スタイル、楽器で条件付け可能な楽曲を生成する音楽生成(dhariwal et al [2020])を含む。
0.71
In the visual domain, this category includes image and video generation.
視覚領域では、このカテゴリは画像とビデオの生成を含む。
0.75
These can also be unconditional (Vahdat et al [2021], Tian et al [2021]), or conditioned for example on a specific class of objects (e g , dogs) from a dataset (Brock et al [2019]), on a layout describing the location of the objects to be included in the output image/video (Sylvain et al [2020]), or on a text caption describing the output image/video (Ramesh et al [2021]).
これらは無条件(vahdat et al [2021], tian et al [2021])、データセット(brock et al [2019])から特定の種類のオブジェクト(例えば犬)、出力画像/ビデオに含まれるオブジェクトの位置を記述したレイアウト(sylvain et al [2020])、または出力画像/ビデオを記述するテキストキャプション(ramesh et al [2021])といったことも可能である。
0.71
The progress of image generation has been remarkable since the introduction of GANs in 2014.
2014年のgans導入以降、画像生成の進歩は目覚ましいものとなっている。
0.72
State-of-the-art GANs trained on domain-specific datasets, such as human faces, can now generate high-resolution photorealistic images of non-existent people, scenes, and objects.3
The resulting images are increasingly difficult to discriminate from real
結果のイメージは現実と区別することがますます難しくなっている
0.65
3See https://www.thispers ondoesnotexist.com for random samples of non-existent faces generated with StyleGAN2 (Karras et al [2020]), and https://thisxdoesnot exist.com for more examples in other domains.
3 see https://www.thispers ondoesnotexist.comst ylegan2(karras et al [2020])とhttps://thisxdoesnot exist.com他のドメインのさらなる例について見て欲しい。
0.76
By "photorealistic images", I mean images that the average viewer cannot reliably distinguish from genuine photographs.
Forthcoming in Synthese Figure 5: Semantic face editing of a photograph of Bertrand Russell with text prompts, produced with StyleCLIP (Patashnik et al [2021]).
合成のこれから 図5:styleclip(patashni k et al [2021])で作成されたテキストプロンプトによるバートランド・ラッセルの写真の意味的な顔編集。
0.70
photographs, even for human faces, on which we are well-attuned to detecting anomalies.4
人間の顔であっても 異常を検出するのに 慣れている写真4
0.70
Other methods now achieve equally impressive results for more varied classes or higher-resolution outputs, such as diffusion models (Dhariwal and Nichol [2021]) and Transformer models (Esser et al [2021]).
他の方法は、拡散モデル(dhariwal and nichol [2021])やトランスフォーマーモデル(esser et al [2021])など、より多様なクラスや高分解能出力に対して等しく印象的な結果が得られる。
0.83
It has also become increasingly easy to guide image generation directly with text.
画像生成を直接テキストでガイドすることもますます簡単になっている。
0.73
DALL-E, a new multimodal Transformer model trained on a dataset of text–image pairs, is capable of generating plausible images in a variety of styles simply from a text description of the desired output (Ramesh et al [2021]).
dall-eは、テキストと画像のペアのデータセットで訓練された新しいマルチモーダルトランスフォーマーモデルであり、所望の出力(ramesh et al [2021])のテキスト記述から単に様々なスタイルで実行可能な画像を生成することができる。
0.79
DALL-E’s outputs can exhibit complex compositional structure corresponding to that of the input text sequences, such as “An armchair in the shape of an avocado”, “a small red block sitting on a large green block”, or “an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants”.
DALL-E has been developed jointly with another multimodal model, called CLIP, capable of producing a natural language caption for any input image (Radford et al [2021]).
DALL-EはCLIPと呼ばれる他のマルチモーダルモデルと共同で開発され、任意の入力画像に対して自然言語キャプションを生成することができる(Radford et al [2021])。
0.78
Using CLIP to steer the generation process, it is also possible to produce images with GANs from natural language descriptions of the desired output (Galatolo et al [2021], Patashnik et al [2021]; see fig. 5).
CLIPを使用して生成プロセスを制御し、所望の出力の自然言語記述(Galatolo et al [2021], Patashnik et al [2021]; 図5参照)から、GANを用いた画像を生成することもできる。
0.72
If these trends continue – and there is no reason for them to slow down significantly as hardware improvement and architectural breakthroughs continue to spur larger and more efficient models –, it is only a matter of time before DL algorithms allow us to generate high-resolution stylized or photorealistic samples of arbitrary scenes that are consistently indistinguishable from human-made outputs.
In the domain of static visual media, that goal is already within sight for medium to large resolutions (around 1024x1024 pixels at the time of writing).
that the model used for generating these images is no longer the state-of-the-art for image generation).
これらの画像を生成するために使用されるモデルは、もはや画像生成の最先端ではない。
0.68
8
8
0.43
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese presents a significantly greater challenge, as the spatiotemporal consistency of the scene needs to be taken into account.
合成のこれから シーンの時空間的一貫性を考慮する必要があるため、はるかに大きな課題が提示される。
0.64
Nonetheless, it is plausible that we will be able to synthesize realistic and coherent video scenes at relatively high resolution in the short to medium term, beyond mere face swapping in existing videos.
DL has expanded the limits of synthetic audiovisual media beyond what was possible with previously available methods.
DLは、これまで利用可能であった方法を超えて、合成オーディオヴィジュアルメディアの限界を広げた。
0.58
Nonetheless, the extent to which DLSAM really differ from traditional synthetic audiovisual media is not immediately clear, aside from the obvious fact that they are produced with deep artificial neural networks.
After all, each of the three categories of DLSAM distinguished in the previous section – global partially synthetic, local partially synthetic, and totally synthetic – has traditional counterparts that do not involve DL.
In fact, it is not implausible that many DLSAM, from visual enhancement to face swapping, could be copied rather closely with more traditional methods, given enough time, skills and resources.
But one might wonder whether the difference between DLSAM and traditional media is merely one of performance and convenience, lowering the barrier to entry for the media creation; or whether there are additional differences that warrant giving a special status to DLSAM in the landscape of audiovisual media.
More specifically, the question is whether DLSAM simply make it easier, faster, and/or cheaper to produce audiovisual media that may have otherwise been produced through more traditional means; or whether they also enable the production of new forms of audiovisual media that challenge traditional categories.
On an epistemic reading, DLSAM may not threaten the taxonomy itself, but simply make it more difficult for media consumers to tell where a specific instance of DLSAM lies within that taxonomy.
The epistemic reading of the discontinuity claim is clearly correct if one insists on making all instances of DLSAM fit within the traditional taxonomy of audiovisual media.
For example, it is increasingly easy to mistake photorealistic GAN-generated images of human faces (in the category of totally synthetic media) for actual photographs of human faces (in the category of archival media).
It should be noted, however, that traditional methods of media manipulation also enable such confusions; DL-based techniques merely make it easier to fool media consumers into misjudging the source of an item.
In what follows, I will focus instead on the ontological reading of the Continuity Question.
以下では、その代わりに、連続性の問題のオントロジな読み方に焦点をあてます。
0.54
I will review a number of ways in which DLSAM differ from other audiovisual media, and ask whether any of these differences genuinely threatens the traditional taxonomy.
Answering this question largely depends on the choice of criteria one deems relevant to distinguishing kinds of audiovisual media.
この疑問への答えは、様々なオーディオ映像メディアの識別に関連する基準の選択に大きく依存する。
0.76
So far, I have mostly considered etiological criteria, which form the basis of the taxonomy illustrated in fig.
これまでのところ,figで示される分類学の基礎をなす,病原学的基準をほとんど考慮してきました。
0.55
1. If we leave etiology aside, and simply consider the auditory and pictorial properties of DLSAM, it seems difficult to see how they really differ from traditional media.
It is also doubtful that DLSAM can be squarely distinguished from traditional media with respect to their intended role, be it communication, deception, art, or entertainment – all of which are also fulfilled by traditional methods.
However, I will argue that the etiology of DLSAM does play a crucial role in setting them apart from other kinds of audiovisual media, for a few different reasons, beyond the surface-level observation that they are produced with DL algorithms.
Once we have a better understanding of how DSLAM are produced, it will become clear that they challenge some categorical distinctions of the traditional taxonomy.
First, deep learning techniques have considerably lowered the bar for the level of technical and artistic skills required to manipulate or synthesize audiovisual media.
Virtually no artistic skills are required for to make a photograph look like a line drawing or a Van Gogh painting with style transfer, to produce a rock song in the style and voice of Elvis Presley using a variational autoencoder,5 to change the eye color, hairstyle, age, or gender of a person in a photograph,6 or to generate an abstract artwork with a GAN.7
When the original “deepfakes” came onto the scene in 2017, for example, it was far from trivial to generate them without prior expertise in programming and deep learning.
Popular social media apps like Instagram, Snapchat, and Tiktok offer a broad range of DL-based filters that automatically apply complex transformations to images and videos, including so-called “beauty filters” made to enhance the appearance of users.
Likewise, third-party standalone apps such as FaceApp and Facetune are entirely dedicated to the manipulation of image and video “selfies”, using DL to modify specific physical features of users ranging from age and gender to the shape, texture, and tone of body parts.
Using these apps requires no special competence beyond basic computer literacy.
これらのアプリを使用するには、基本的なコンピュータリテラシー以上の特別な能力は必要ない。
0.48
More polished and professional results can also be achieved with user-friendly computer software.
より洗練され、プロフェッショナルな結果もユーザーフレンドリーなコンピュータソフトウェアで得られる。
0.69
For example, DeepFaceLab is a new software designed by prominent deepfake creators that provides “an easy-to-use way to conduct high-quality face-swapping” (Perov et al [2021]).
例えば、deepfacelabは著名なdeepfakeクリエーターがデザインした新しいソフトウェアで、“高品質なスワッピングを行うための使いやすい方法”を提供する(perov et al [2021])。
0.74
DeepFaceLab has an accessible user interface and allows anyone to generate high-resolution deepfakes without any coding skills.
NVIDIA recently released Canvas, a GAN-powered software to synthesize photorealistic images given a simple semantic layout painted in very broad strokes by the user (Park et al [2019]).8
NVIDIAは最近Canvasをリリースした。これは、ユーザが非常に広いストロークで描いた単純なセマンティックレイアウト(Park et al [2019])を使って、フォトリアリスティックな画像を合成するソフトウェアだ。
0.69
Even Photoshop, the most popular software for traditional image manipulation, now includes “neural filters” powered by deep learning (Clark [2020]).
These pieces of software are not merely used for recreational purposes, but also by creative professionals who can benefit from the efficiency and convenience of DL-based techniques.
The recent progress of text-to-image generation discussed in the previous section is a further step towards making sophisticated audiovisual media manipulation and synthesis completely trivial.
Instead of requiring users to fiddle with multiple parameters to modify or generate images and videos, it allows them to merely describe the desired output in natural language.
As we have seen, GAN-based algorithms like StyleCLIP enable users to change various attributes of a subject in a photograph with simple captions (fig. 5, Patashnik et al [2021]), and can even be used through an easy user interface without programming skills.9
これまでに見てきたように、styleclipのようなganベースのアルゴリズムは、ユーザーは簡単なキャプション(図5、patashnik et al [2021])で写真の主題のさまざまな属性を変更できる。 訳抜け防止モード: これまで見たように、StyleCLIPのようなGANベースのアルゴリズムは、簡単なキャプション(図5, Patashnik et al [2021 ])で被写体のさまざまな属性を変更できる。 プログラミングのスキルを使わずに 簡単なユーザーインターフェースで使うこともできます
0.74
Other previously mentioned methods based on multimodal Transformer models allow the synthesis of entirely novel images from text input (Ramesh et al [2021]).
前述したマルチモーダルトランスフォーマーモデルに基づく他の手法では、テキスト入力から完全に新しいイメージを合成することができる(ramesh et al [2021])。
0.78
A similar procedure has been successfully applied to the manipulation of videos through simple text prompts (Skorokhodov et al [2021]).
同様の手順は、単純なテキストプロンプト(skorokhodov et al [2021])によるビデオ操作にもうまく適用されている。
0.81
It is plausible that further progress in this area will eventually enable anyone to modify or generate any kind of audiovisual media in very fine detail, simply from natural language descriptions.
For example, the aforementioned smartphone apps offload much of the DL processing to the parent companies’ servers, thereby considerably reducing the computational requirements on the users’ devices.
Almost any modern smartphone can run these apps, and produce close to state-of-the-art DLSAM in a variety of domains, at no additional computational cost.
Given these developments, it is no surprise that consumer-facing companies like Facebook, Snap, ByteDance, Lightricks, Adobe, and NVIDIA – the developers of Instagram, Snapchat, Tiktok, Facetune, Photoshop, and Canvas respectively – are at the forefront of fundamental and applied research on DL-based computer vision and image/video synthesis (e g , [He et al , 2017, Tian et al , 2021, Yu et al , 2019, Halperin et al , 2021, Patashnik et al , 2021, Karras et al , 2021]).
これらの発展を考えると、Facebook、Snap、ByteDance、Lightricks、Adobe、NVIDIA(それぞれInstagram、Snapchat、Tiktok、Facetune、Photoshop、Canvasの開発者)がDLベースのコンピュータビジョンと画像/画像合成(例: [He et al , 2017 Tian et al , 2021, Yu et al , 2019, Halperin et al , 2021, Patashnik et al , 2021, Karras et al , 2021])の基本的かつ応用的な研究の最前線にいるのは驚くにあたらない。
0.78
Taken together, the lowered requirements on artistic skill, technical competence, computational power, and production time afforded by deep learning algorithms have dramatically changed the landscape of synthetic audiovisual media.
Using traditional audio, image, and video editing software to obtain comparable results would require, in most cases, considerably more time and skill.
By itself, this impressive gap does not seem to challenge traditional taxonomical distinctions: DL algorithms make synthetic media easier and faster to produce, but this does not entail their output do not fit squarely within existing media categories.
One might be tempted to compare this evolution to the advent of digital image editing software in the 2000s, which made retouching photographs easier, faster, and more effective than previous analog techniques based on painting over negatives.
However, this comparison would be selling the novelty of DLSAM short.
しかし、この比較はDLSAMのノベルティを短く売ることになる。
0.73
The range of generative possibilities opened up by deep learning in the realm of synthetic media far outweighs the impact of traditional editing software.
I suggested earlier that some DLSAM could be imitated to some degree with traditional techniques given enough time and means, including professional tools and expertise.
While this has been true for a long time, the gap between what
これは長い間真実だったが、何が違うのかというギャップがある。
0.57
8The method can also be tried in this online demo: http://nvidia-resear ch-mingyuliu.com/gau gan.
8 このメソッドは、このオンラインデモでも試すことができる。
0.42
9See https://youtu.be/5ic I0NgALnQ for some examples.
例を挙げると、https://youtu.be/5ic i0ngalnqを参照。
0.33
10
10
0.42
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese can be achieved with and without DL algorithms is widening rapidly, and an increasing number of DLSAM simply se5em impossible to produce with traditional methods, no matter the resources available.
The progress of VFX has led the film industry to experiment with this kind of manipulation over the past few years, to recreate the faces of actors and actresses who cannot be cast in a movie, or digitally rejuvenate cast members.
Thus, Rogue One (2016) features scenes with Peter Cushing, or rather a posthumous computer-generated duplicate of Cushing painstakingly recreated from individual frames – the actor having passed away in 1994.
Likewise, Denis Villeneuve’s Blade Runner 2049 (2017) includes a scene in which the likeness of Sean Young’s character in the original Blade Runner (1982) is digitally added.
For The Irishman (2019), Martin Scorsese and Netflix worked extensively with VFX company Industrial Light & Magic (ILM) to digitally “de-age” Robert De Niro, Al Pacino, and Joe Pesci for many scenes of the movie.
2019年の『アイリッシュマン』では、マーティン・スコセッシとnetflixはvfx社のインダストリアル・ライト&マジック(ilm)と広範囲に協力し、ロベルト・デ・ニーロ、アル・パチーノ、ジョー・ペシをデジタル化させた。 訳抜け防止モード: The Irishman(2019年)のために、Martin Scorsese氏とNetflixはVFXのIndustrial Light & Magic(ILM)とデジタルで"de - age"Robert De Niro氏とともに仕事をした。 アル・パチーノとジョー・ペシは映画の多くの場面で出演した。
0.72
The VFX team spent a considerable amount of time studying older movies featuring these actors to see how they should look at various ages.
They shot the relevant scenes with a three-camera rig, and used a special software to detect subtle differences in light and shadows on the actors’ skin as reference points to carefully replace their faces with younger-looking computer-generated versions frame by frame.
These examples of digital manipulation are extraordinarily costly and time-consuming, but certainly hold their own against early DL-based approaches rendered at low resolutions.
State-of-the-art deepfakes, however, have become very competitive with cutting-edge VFX used in the industry.
しかし最先端のディープフェイクは、業界で使われている最先端のVFXと非常に競合している。
0.43
In fact, several deepfakes creators have claimed to produce better results at home in just a few hours of work on consumer-level hardware than entire VFX teams of blockbuster movies with a virtually unlimited budget and months of hard work.10
These results are so impressive, in fact, that one of the most prominent deepfakes creator, who goes by the name “Shamook” on Youtube, was hired by ILM in 2021.
The paper introducing DeepFaceLab is also explicitly targeted at VFX professional in addition to casual creators, praising the software’s ability to “achieve cinema-quality results with high fidelity” and emphasizing its potential “high economic value in the post-production industry [for] replacing the stunt actor with pop stars” (Perov et al [2021]).
DeepFaceLabを紹介する論文は、カジュアルなクリエーターに加えて、VFXプロフェッショナルを対象としており、ソフトウェアが“映画品質の成果を高い忠実さで獲得する”能力を強調し、“スタント俳優をポップスターに置き換えるための]ポストプロダクション産業における高い経済的価値”を強調している(Perov et al [2021])。
0.82
Other forms of partial audiovisual synthesis afforded by DL algorithms are leaving traditional techniques behind.
DLアルゴリズムによって得られる他の部分的なオーディオ視覚合成形式は、従来の技術を残している。
0.55
Text-to-speech synthesis sounds much more natural and expressive with DL-based approaches, and can be used to convincingly clone someone’s voice (Shen et al [2018]).
テキストと音声の合成は、DLベースのアプローチによりずっと自然で表現力があり、誰かの声を確実にクローンするのに使うことができる(Shen et al [2018])。
0.66
These methods, sometimes referred to as “audio deepfakes”, have been recently used among real archival recordings in a 2021 documentary about Anthony Bourdain, to make him posthumously read aloud an email sent to a friend.
The fake audio is not presented as such, and is seamless enough that it was not detected by critics until the director confessed to the trick in an interview (Rosner [2021]).
Voice cloning can also be combined with face swapping or motion transfer to produce convincing fake audiovisual media of a subject saying anything (Thies et al [2020]).
音声のクローニングは、顔交換やモーション転送と組み合わせて、何でも言う被写体(Thies et al [2020])の偽の映像メディアを生成することもできる。
0.69
Traditional methods simply cannot achieve such results at the same level of quality, let alone in real time.
従来の手法では、そのような結果を同じレベルの品質で達成することはできない。
0.64
Total synthesis is another area where DL has overtaken other approaches by a wide margin.
It is extremely difficult to produce a completely photorealistic portrait of an arbitrary human face or object in high-resolution from scratch without deep generative models like GANs.
Dynamic visual synthesis is even more of a challenge, and even professional 3D animators still struggle to achieve results that could be mistaken for actual video footage, especially when humans are involved (hence the stylized aesthetic of most animated movies).
The progress of DL opens up heretofore unimaginable creative possibilities.
dlの進歩は、想像もできないほど創造的な可能性を開く。
0.54
This remark goes beyond DL’s ability to generate synthetic audiovisual media of the same kind as those that produced with traditional methods, simply with incremental improvements in quality – e g resolution, detail, photorealism, or spatiotemporal consistency.
4.3 Blurred lines I have previously discussed how DLSAM would fit in a taxonomy of audiovisual media that encompasses all traditional approaches (see fig. 1).
While this is helpful to distinguish different categories of DLSAM at a high level of abstraction, instead of conflating them under the generic label “deepfake”, the boundaries between these categories can be challenged under closer scrutiny.
Indeed, DL-based approaches to media synthesis have started blurring the line between partially and totally synthetic audiovisual media in novel and interesting ways.
First, DLSAM arguably straddle the line between partial and total synthesis insofar as they are never generated ex vacuo, but inherit properties of the data on which the models that produced them were trained.
In order to produce convincing outputs in a given domain (e g , photorealistic images of human faces), artificial neural networks must be trained on a vast amount of samples from that domain (e g , actual photographs of human faces).
Forthcoming in Synthese Figure 6: A photograph of Russell and its inversion in the latent space of StyleGAN.
合成のこれから 図6:StyleGANの潜伏空間におけるラッセルとその逆転の写真。
0.70
totally synthetic, insofar as they leverage properties of preexisting images present in their training data, and seamlessly recombine them in coherent ways.
This is likely to happen if the model suffers from “overfitting”, namely if it simply memorizes samples from its dataset during training, such that the trained model outputs a near copy of one of the training samples, rather than generating a genuinely novel sample that looks like it came from the same probability distribution.
Since there is no robust method to completely rule out overfitting with generative models, one cannot determine a priori how different a synthesized human face, for example, will really be from a preexisting photograph of a human face present in the training data.
More generally, DL-based audiovisual synthesis can vary widely in how strongly it is conditioned on various parameters, including samples of synthesized or real media.
For example, video generation is often conditioned on a preexisting image (Liu et al [2020]).
例えば、ビデオ生成は、しばしば既存の画像(liu et al [2020])で条件付けされる。
0.77
This image can be a real photograph, in which case it seems closer to a form of partial synthesis; but it can also be a DL-generated image (e g , the output of a GAN).
Thus, the link between the outputs of deep generative models and preexisting media can be more or less distant depending on the presence of conditioning, its nature, and the overall complexity of the generation pipeline.
The distinction between partially and totally synthetic media is also challenged by DLSAM in the other direction, to an even greater extent, when considering examples that I have previously characterized as instances of partially synthetic media.
Many examples of state-of-the-art DL-based visual manipulation rely on reconstructing an image with a deep generative model, in order to modify some of its features by adjusting the model’s encoding of the reconstructed image before generating a new version of it.
The initial step is called “inversion”: it consists in projecting a real image into the latent embedding space of a generative model (Xia et al [2021], Abdal et al [2019], Richardson et al [2021]).
最初のステップは「反転」と呼ばれ、生成モデルの潜在埋め込み空間(Xia et al [2021], Abdal et al [2019], Richardson et al [2021])に実像を投影する。
0.67
Every image that can be generated by a generative model corresponds to a vector in the model’s latent space.
生成モデルによって生成されるすべての画像は、モデルの潜在空間内のベクトルに対応する。
0.83
In broad terms, inverting a real image i into the latent space Z of a generative model consists finding the vector z that matches i most closely in Z, by minimizing the difference between i and the image generated from z (see fig. 2).
広義的には、生成モデルの潜在空間 Z に実像 i を反転させることは、i と z から生成される像との差を最小化することによって、Z に最も近いベクトル z を求めることである(図2参照)。
0.79
In fig. 5, for example, the “original” image on the left is not actually a photograph of Bertrand Russell, but the reconstruction of such a photograph produced by inverting it into the latent space of a GAN trained on human faces.
The original photograph and its GAN inversion are presented side-by-side in fig.
原写真とそのgan反転は、図に並べて示される。
0.65
6. The two images clearly differ in various respects: the pipe that Russell holds in his hand in the original photograph is removed, and various details of the lighting, hair, eyes, nose, ears, lower face, and neck are slightly altered.
While the likeness of Russell is rather well captured, the imitation is noticeably imperfect.
ラッセルの類似性はかなり良く捉えられているが、模倣は明白に不完全である。
0.51
One could not accurately describe the image on the right as a photograph of Bertrand Russell.
右の写真は、ベルトランド・ラッセルの写真として正確には説明できなかった。
0.60
Characterizing images produced by manipulating features through GAN inversion as only partially synthetic is somewhat misleading, to the extent that the images were entirely generated by the model, and do not embed any actual part of the real photograph that inspired them.
There is a significant difference between this process and the form of local partial synthesis at play, for example, in similar manipulations using traditional software like Photoshop.
In the latter case, one genuinely starts from a real photograph to modify or overlay some features (e g , add some glasses); the final results include a significant proportion of the original image, down to the level of individual pixels (unless the compression or resolution was changed).
With GAN-based image manipulation, by contrast, what is really modified is not the original image, but its “inverted” counterpart which merely resembles it.
The resemblance can be near perfect, but there are often noticeable differences between an image and its GAN inversion, as in fig.
類似性はほぼ完璧だが、図のように画像とganの反転の間には明らかな違いがある。
0.62
6. 12
6. 12
0.42
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese DL algorithms are also blurring the line between archival and synthetic media.
合成のこれから DLアルゴリズムは、アーカイブと合成メディアの境界線を曖昧にしている。
0.68
Many smartphones now include an automatic DL-based post-processing pipeline that artificially enhances images and videos captured with the camera sensor, to reduce noise, brighten the image in low-light condition, fake a shallow depth of field by emulating background blur (so-called “portrait mode”), and/or increase the media’s resolution.
While there is always a layer of post-processing reflecting technical and aesthetic preferences in going from raw sensor data to an audio recording, image, or video, these DL-based pipeline go further in augmenting the sensor data with nonexistent details synthesized by generative models.
Super-resolution is a good example of that process (fig. 3).
超解像はその過程のよい例である(図3)。
0.65
In principle, the goal of super-resolution is simply to increase the resolution of an image or video.
原則として、超解像度の目標は、画像やビデオの解像度を上げることにある。
0.72
In practice, however, even when there is a ground truth about what an image/video would look like at higher resolution (i.e., if it has been downscaled), it is virtually impossible for a super-resolution algorithm to generate a pixel-perfect copy of the higher-resolution version.
While the image reconstructed from the very pixelated input in fig.
画像はfigの非常にピクセル化された入力から再構築されました。
0.40
3 is undoubtedly impressive, and strikingly similar to the ground truth, it is far from identical to it.
3は間違いなく印象的であり、地上の真実と非常に似ているが、それとまったく同じではない。
0.73
If all images produced by a camera were going through such a super-resolution pipeline, it would be misleading to characterize them as archival visual media in the sense defined earlier.
In the age of “AI-enhanced” audiovisual media, archival and synthetic media increasingly appear to fit on a continuum rather than in discrete categories.
This remark is reminiscent of Walton’s suggestion that depictions created through mechanical means may exhibit various degrees of transparency depending on their production pipeline (Walton [1984]).
On Walton’s view, photographs created by combining two negatives are only partially transparent, because they don’t have the right kind of mechanical contact with the constructed scene, but they do with the scenes originally depicted by each negative.
Likewise, he argues that an overexposed photograph displays a lower degree of transparency than a well-exposed one, or a grayscale photograph than one in color, etc.
On this view, one can see a scene through a photograph – be in “contact” with it – to various degrees, but most photographic manipulations do not make the output entirely opaque.
They seem to put further pressure on a sharp divide between transparency and opacity.
透明性と不透明さの間には、さらに大きなプレッシャーがかかっているようだ。
0.63
A photograph taken by a smartphone, and automatically processed through DL-based denoising and super-resolution algorithms, does appear to put us in contact with reality to some degree.
スマートフォンで撮影され、dlベースのノイズ処理とスーパーレゾリューションアルゴリズムで自動的に処理された写真は、現実とある程度接触しているように見える。 訳抜け防止モード: スマートフォンで撮影され、dl - based denoising and super- resolutionアルゴリズムで自動的に処理される写真。 ある程度現実と接触しているようです
0.75
In fact, the image enhancement pipeline might recover real details in the photographed scene that would not be visible on an image obtained from raw sensor data.
These details are certainly not recovered through a “humanely mediated” contact with the scene, as would be the case if someone was digitally painting over the photograph.
But they also stretch the definition of what Walton calls a “mechanical connection” with reality.
しかし、waltonが“機械的なつながり”と呼ぶものの定義も広げている。
0.65
Indeed, details are added because the enhancement algorithm has learned the approximate probability distribution of a very large number of other photographs contained in its training data.
The complete process – from the training to the deployment of a DL algorithm on a particular photograph – could be described as mechanistic, but the “contact” between the final output and the depicted scene is mediated, in some way, by the “contact” between millions of training samples and the scenes they depict.
Walton also highlights that manipulated photographs may appear to be transparent in respects in which they are not ([Walton, 1984, pp. 44]).
ウォルトンはまた、操作された写真は、それらがそうではない点で透明に見えるかもしれないことも強調する([walton, 1984, pp. 44])。
0.74
This is all the more true for DLSAM given the degree of photorealism they can achieve.
これは、DLSAMが達成できる光リアリズムの度合いを考えると、より真実である。
0.67
A convincing face-swapping deepfake may be indistinguishable from a genuine video, hence their potential misuse for slandering, identity theft, and disinformation.
The output does faithfully reflect the features of Tom Cruise’s face, based on the summary statistics captured by the autoencoder architecture during training on the basis of genuine photographs of that face.
Furthermore, this process is not mediated by the intentional attitudes of the deepfake’s creator, in the way in which making a painting or a 3D model of Tom Cruise would be.
A totally synthetic GAN-generated face is a more extreme case; yet it, too, is mechanically generated from the learned probability distribution of real photographs of human faces.
Neither of these examples appear to be completely opaque in Walton’s sense, although they can certainly be very deceiving in giving the illusion of complete transparency.
While traditional media do include a few edge cases, these are exceptions rather than the rule.
伝統的なメディアにはいくつかのエッジケースがあるが、これらはルールではなく例外である。
0.63
The remarkable capacity of DL models to capture statistically meaningful properties of their training data and generate
学習データの統計的有意義な特性を捉えて生成するdlモデルの顕著な能力
0.82
13
13
0.85
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese convincing samples that have similar properties challenges the divide between reality and synthesis, as well as more fine-grained distinctions between kinds of media synthesis.
The way in which deep generative models learn from data has deeper implications for the Continuity Question.
データから深い生成モデルを学習する方法は、継続性質問に深く影響します。
0.72
According to the manifold hypothesis, real-world high-dimensional data tend to be concentrated in the vicinity of lowdimensional manifolds embedded in a high-dimensional space (Tenenbaum et al [2000], Carlsson [2009], Fefferman et al [2016]).
多様体仮説によれば、実世界の高次元データは高次元空間に埋め込まれた低次元多様体の近傍に集中する傾向がある(Tenenbaum et al [2000], Carlsson [2009], Fefferman et al [2016])。
0.79
Mathematically, a manifold is a topological space that locally resembles Euclidean space; that is, any given point on the manifold has a neighborhood within which it appears to be Euclidean.
A sphere is an example of a manifold in three-dimensional space: from any given point, it locally appears to be a two-dimensional plane, which is why it has taken humans so long to figure out that the earth is spherical rather than flat.
In the context of research on deep learning, a manifold refers more loosely to a set of points that can be approximated reasonably well by considering only a small numbers of dimensions embedded in a high-dimensional space (Goodfellow et al [2016]).
ディープラーニングの研究の文脈において、多様体は高次元空間に埋め込まれた少数の次元(Goodfellow et al [2016])のみを考えることにより、合理的に近似できる点の集合をより緩く参照する。
0.68
DL algorithms would have little chance of learning successfully from n-dimensional data if they had to fit a function with interesting variations across every dimension in Rn.
If the manifold hypothesis is correct, then DL algorithm can learn much more effectively by fitting low-dimensional nonlinear manifolds to sampled data points in high-dimensional spaces – a process known as manifold learning.
There are theoretical and empirical reasons to believe that the manifold hypothesis is at least approximately correct when it comes to many kinds of data fed to DL algorithms, including audiovisual media.
Suppose that we generate such an image by choosing random color pixel values; the chance of obtaining anything that looks remotely different from uniform noise is absurdly small.
This is because the probability distribution of real-world 512x512 images (i.e., images that mean something to us) is concentrated in a small region of R786432, on a low-dimensional manifold.11
We can also intuitively think of transforming audiovisual media within a constrained region of their input space.
また、入力空間の制約領域内でオーディオ映像メディアを変換することを直感的に考えることもできる。
0.64
For example, variations across real-world images can be boiled down to changes along a constrained set of parameters such as brightness, contrast, orientation, color, etc.
These transformations trace out a manifold in the space of possible images whose dimensionality is much lower than the number of pixels, which lends further credence to the manifold hypothesis.
Consequently, we can expect an efficient DL algorithm trained on images to represent visual data in terms of coordinates on the low-dimensional manifold, rather than in terms of coordinates in Rn (where n is three times the number of pixels for a color image).
Deep generative models used for audiovisual media synthesis are good examples of manifold learning algorithms.
視聴覚メディア合成に用いられる深層生成モデルは、多様体学習アルゴリズムの好例である。
0.76
For example, the success of GANs in generating images that share statistically relevant properties with training samples (e g , photorealistic images of human faces) can be explained by the fact that they effectively discover the distribution of the dataset on a low-dimensional manifold embedded in the high-dimensional input space.
This is apparent when interpolating between two points in the latent space of a GAN, namely traversing the space from one point to the other along the learned manifold: if the image corresponding to each point is visualized as a video frame, the resulting video shows a smooth – spatially and semantically coherent – transformation from one output to another (e g , from one human face to another).12
Over the course of training, it extracts latent features of faces by modeling the natural distribution of these features along a low-dimensional manifold (Bengio et al [2013]).
訓練の過程で、低次元多様体(Bengio et al [2013])に沿ってこれらの特徴の自然な分布をモデル化することによって、顔の潜在特徴を抽出する。
0.77
The autoencoder’s success in mapping one human face to another, in a way that is congruent with head position and expression, is explained by the decoder learning a mapping from the low-dimensional latent space to a manifold embedded in high-dimensional space (e g , pixel space for images) (Shao et al [2017]).
オートエンコーダは、頭部の位置や表現と一致した方法で、一方の顔と他方の顔のマッピングの成功を、低次元の潜在空間から高次元空間に埋め込まれた多様体(例えば、画像のピクセル空間)への写像を学ぶデコーダで説明する(Shao et al [2017])。
0.72
Thus, deep generative models used for audiovisual media synthesis effectively learn the distribution of data along nonlinear manifolds embedded in high-dimensional input space.
これにより,高次元入力空間に埋め込まれた非線形多様体に沿ったデータの分布を効果的に学習する。
0.70
This highlights a crucial difference between DLSAM and traditional synthetic media.
これはDLSAMと従来の合成メディアとの重要な違いである。
0.73
Traditional media manipulation and synthesis is mostly ad hoc: it consists in 11This toy example ignores the fact that natural images may lie on a union of disjoint manifolds rather than one globally connected manifold.
For example, the manifold of images of human faces may not be connected to the manifold of images of tropical beaches.
例えば、人間の顔の画像の多様体は、熱帯のビーチの画像の多様体に関連付けられていないかもしれない。
0.66
12See https://youtu.be/6E1 _dgYlifc for an example of video interpolation in the latent space of StyleGAN trained on
12See https://youtu.be/6E1 _dgYlifc for a example of video interpolation in the latent space of StyleGAN trained on
0.37
photographs of human faces. 14
人間の顔の写真。 14
0.61
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese transforming or creating sounds, images, and videos in a specific way, with a specific result in mind.
合成のこれから 特定の結果を念頭に置いて、特定の方法で音、画像、動画を変換または生成する。
0.73
Many of the steps involved in this process are discrete manipulations, such as removing a portion of a photograph in an image editing software, or adding a laughing track to a video.
These manipulations are tailored to a particular desired output.
これらの操作は特定の所望の出力に合わせて調整される。
0.53
By contrast, DLSAM are sampled from a continuous latent space that has not been shaped by the desiderata of a single specific output, but by manifold learning.
Accordingly, synthetic features of DLSAM do not originate in discrete manipulations, but from a mapping between two continuous spaces – a low-dimensional manifold and a high-dimensional input space.
This means that in principle, synthetic features of DLSAM can be altered as continuous variables.
これは、原則としてDLSAMの合成特徴を連続変数として変更することができることを意味する。
0.64
This is also why one can smoothly interpolate between two images within the latent space of a generative model, whereas it is impossible to go from an image to another through a continuous transformation with an image editing software.
Beyond manifold learning, recent generative models have been specifically trained to learn disentangled representations (Shen et al [2020], Collins et al [2020], Härkönen et al [2020], Wu et al [2021]).
多様体学習以外にも、最近の生成モデルは、非交叉表現(Shen et al [2020], Collins et al [2020], Härkönen et al [2020], Wu et al [2021])を学ぶために特別に訓練されている。
0.84
As a general rule, the dimensions of a model’s latent space do not match neatly onto interpretable features of the data.
一般的な規則として、モデルの潜在空間の次元はデータの解釈可能な特徴とうまく一致しない。
0.73
For example, shifting the vector corresponding to a GAN-generated image along a particular dimension of the generator’s latent space need not result in a specific visual change that clearly corresponds to some particular property of the depiction, such as a change in the orientation of the subject’s face for a model trained on human faces.
Instead, it might result in a more radical visual change in the output, where few features of the original output are preserved.
代わりに、出力がより急激な視覚的変化をもたらす可能性があり、元の出力の特徴がほとんど残っていない。
0.68
Disentanglement loosely refers to a specific form of manifold learning in which each latent dimension controls a single visual attribute that is interpretable.
遠角化とは、各潜在次元が解釈可能な単一の視覚特性を制御する、特定の多様体学習の形式を指す。
0.65
Intuitively, disentangled dimensions capture distinct generative factors – interpretable factors that characterize every sample from the training data, such as the size, color, or orientation of a depicted object (Bengio et al [2013], Higgins et al [2018]).
直観的には、異方性次元は異なる生成要因を捉えている: 訓練データから各サンプルを特徴付ける解釈可能な要素、例えば、描かれた対象のサイズ、色、向きなど(bengio et al [2013], higgins et al [2018])。
0.77
The advent of disentangled generative models has profound implications for the production of DLSAM.
不整合生成モデルの出現はDLSAMの生産に重大な影響を及ぼす。
0.77
Disentanglement opens up new possibilities for manipulating any human-interpretable attribute within latent space.
アンタングルメントは、潜在空間内で人間解釈可能な属性を操作する新しい可能性を開く。
0.61
For example, one could generate a photorealistic image of a car, then manipulate specific attributes such as color, size, type, orientation, background, etc.
Each of these disentangled parameters can be manipulated as continuous variables within a disentangled latent space, such that one can smoothly interpolate between two outputs along a single factor – for example, going from an image of a red car seen from the left-hand side to an image of an identical red car seen from the right-hand side, with a smooth rotation the vehicle, keeping all other attributes fixed.
Disentangled representations can even be continuously manipulated with an easy user interface, such as sliders corresponding to each factor (Härkönen et al [2020], Abdal et al [2020]).13
離角表現は、各因子に対応するスライダー(Härkönen et al [2020], Abdal et al [2020])のような簡単なユーザインタフェースで連続的に操作することもできる。
0.78
Combined with aforementioned methods to “invert” a real image within the latent space of a generative model, disentanglement is becoming a novel and powerful way to manipulate pre-existing visual media, including photographs, with impressive precision.
Thus, the manipulation of simple and complex visual features in fig.
したがって、fig における単純で複雑な視覚特徴の操作である。
0.63
5 is made not only possible but trivial with a well-trained disentangled GAN.
5 は、訓練された不等角形 gan で可能なだけでなく、自明である。
0.48
With domain-general generative models such as BigGAN (Brock et al [2019]), the combination of inversion and disentanglement will soon allow nontechnical users to modify virtually any photograph along meaningful continuous dimensions.
biggan(brock et al [2019])のようなドメイン一般生成モデルによって、反転と絡み合いの組み合わせにより、非技術ユーザは、意味のある連続次元に沿って事実上任意の写真を修正できるようになる。 訳抜け防止モード: ドメインで - BigGAN (Block et al [ 2019 ])のような一般的な生成モデル 逆転と逆転の組み合わせは 非技術者は 意味のある連続した次元に沿って あらゆる写真を修正できる
0.77
Multimodal Transformer models trained on textimage pairs like CLIP make this process even easier by allowing users to simply describe in natural language the change they would like to see effected in the output, while specifying the magnitude of the desired manipulation (fig. 5, Patashnik et al [2021]).
CLIPのようなテキストイメージペアでトレーニングされたマルチモーダルトランスフォーマーモデルにより、ユーザは、出力に影響を及ぼしたい変更を自然言語で簡単に記述し、所望の操作の大きさを指定できるようになる(図5、Patashnik et al [2021])。 訳抜け防止モード: CLIP makeのようなテキストイメージペアでトレーニングされたマルチモーダルトランスフォーマーモデル このプロセスは ユーザが自然言語で、出力に影響を及ぼしたい変更を簡単に記述できるようにすることによって。 所望の操作の大きさ(図5、Patashnik et al [2021 ])を指定しながら。
0.84
Beyond static visual manipulation, “steering” the latent space of a disentangled generative model has the potential to allow any image to be animated in semantically coherent ways (Jahanian et al [2020]).
静的なビジュアル操作以外にも、不連続生成モデルの潜在空間を“ステアリング”することで、任意の画像を意味的にコヒーレントな方法でアニメーション化することができる(jahanian et al [2020])。
0.77
There is, in principle, no difference between dynamically steering the latent space of a “static” generative model, and generating a photorealistic video.
For example, one could invert the photograph of a real human face into latent space, then animate it by steering the space along disentangled dimensions – moving the mouth, eyes, and entire head in a natural way.
Thus, the task of video synthesis can now be reduced to discovering a trajectory in the latent space of a fixed image generator, in which content and motion are properly disentangled (Tian et al [2021]).
これにより、映像合成のタスクは、コンテンツや動きが適切に切り離された固定画像生成装置の潜時空間の軌跡を発見するまで短縮することができる(Tian et al [2021])。
0.72
State-of-the-art methods to remove texture inconsistencies during interpolation demonstrate that latent space steering can be virtually indistinguishable from real videos (Karras et al [2021]).
補間中のテクスチャの不整合を取り除く最先端の手法は、潜時空間ステアリングが実際のビデオとほぼ区別できないことを示す(Karras et al [2021])。
0.72
In the near future, it is likely that any image or photograph can be seamlessly animated through this process, with congruent stylistic attributes – from photorealistic to artistic and cartoonish styles.
The capacity to steer generative models along interpretable dimensions in real time also paves the way for a new kind of synthetic medium: controllable videos that we can interact with in the same robust way that we interact with video games (Menapace et al [2021], Kim et al [2021]).
2021年、menapace et al [2021年]、kim et al [2021年]など、ビデオゲームと対話するのと同じ堅牢な方法で対話できる、制御可能なビデオだ。 訳抜け防止モード: 実時間における解釈可能な次元に沿って生成モデルを操る能力 また、新しいタイプの合成媒体の道を開く。ビデオゲームと対話するのと同じ堅牢な方法で操作できるコントロール可能なビデオ(Menapace et al [2021 ])。 とKim et al [2021 ] )。
0.71
For example, one can use a keyboard to move a tennis player forward, backward, leftward, and rightward on a field in a synthetic video.
html for interactive examples of disentangled GAN interpolation with images of human faces.
GAN補間と人間の顔の画像の対話的な例を示す。
0.71
15
15
0.43
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese invert a real photograph within the latent space of a generative model, then animate elements of the photograph in real-time in a synthetic video output.
This kind of synthetic media pipeline has no equivalent with more traditional methods.
この種の合成メディアパイプラインは、従来の方法と同等ではない。
0.74
The interactive nature of the resulting DLSAM is only matched by traditional video games, but these still fall short of the photorealism achieved by deep generative models, lack the versatility of what can be generated from the latent space of a domain-general model, and cannot be generated from pre-existing media such as photographs without significant processing and manual labor (e g with photogrammetry).
They further blur the line between the archival and the synthetic domain, since well-trained generative models capture the dense statistical distribution of their training data, and can seamlessly produce new sample or reconstitute existing media from that learned distribution.
Disentanglement allows fine-grained control over the output of such models along specific interpretable dimensions, creating unforeseen possibility for media manipulation and real-time synthesis, with many more degrees of freedom than what was possible from previous techniques.
This change is more significant than the shift from analog to digital production tools.
この変化はアナログからデジタル生産ツールへの移行よりも重要である。
0.80
DL-based synthetic audiovisual media, including original deepfakes, require a lot less time, artistic skills, and – increasingly – technical expertise and computational resources to produce.
They also greatly surpass traditional techniques in many domains, particularly for the creation and manipulation of realistic sounds, images, and videos.
Beyond these incremental improvements, however, DLSAM represent a genuine departure from previous approaches that opens up new avenues for media synthesis.
Manifold learning allows deep generative models to learn the probability distribution of millions of samples in a given domain, and generate new samples that fall within the same distribution.
Disentanglement allows them to navigate the learned distribution along human-interpretable generative factors, and thus to manipulate and generate high-quality media with fine-grained control over their discernible features.
Unlike traditional methods, the generative factors that drive the production of DSLAM exist on a continuum as dimensions of the model’s latent space, such that any feature of the output can in principle be altered as a continuous variable.
These innovations blur the boundary between familiar categories of audiovisual media, particularly between archival and synthetic media; but they also pave the way for entirely novel forms of audiovisual media, such as controllable images and videos that can be navigated in real-time like video games, or multimodal generative artworks (e g , images and text jointly sampled from the latent space of a multimodal model).
While this shift does have concerning ethical implications for the potentially harmful uses of DLSAM, it also opens up exciting possibilities for their beneficial use in art and entertainment.14
Stochastic backpropagation and approximate inference in deep generative models.
深部生成モデルにおける確率的バックプロパゲーションと近似推論
0.68
In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, June 2014.
Eric P. Xing, Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, Volume 32 of Proceedings of Machine Learning Research, page 1278–1286, Bejing, China, June 2014 訳抜け防止モード: eric p. xing氏とtony jebara氏は、第31回機械学習国際会議の編集者である。 機械学習研究議事録第32巻 1278-1286頁、北京、中国、2014年6月。
0.65
PMLR. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
PMLR。 Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio 訳抜け防止モード: PMLR。 ian goodfellow, jean pouget - アバディ, mehdi mirza, bing xu, デビッド・ウォード - ファーリー、シャージル・オゼア、アーロン・クールヴィル。 そしてヨシュア・ベンジオ
0.46
Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27.
敵ネットの生成。 Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, Volume 27。
0.55
Curran Associates, Inc., 2014.
curran associates, inc., 2014年。
0.64
Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie.
DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks.
DeblurGAN: 条件付き対向ネットワークを用いたブラインドモーションデブロアリング。
0.73
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8183–8192, 2018.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, page 8183–8192, 2018。 訳抜け防止モード: IEEE Conference on Computer Vision and Pattern Recognition に参加して 8183-8192、2018年。
0.75
Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong Chen, Jing Liao, and Fang Wen.
Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong Chen, Jing Liao, Fang Wen 訳抜け防止モード: ziyu wan, bo zhang, dongdong chen, pan zhang, ドン・チェン、ジン・リアオ、ファン・ウェン。
0.56
Old Photo Restoration via Deep Latent Space Translation.
古い写真復元 深層潜入宇宙翻訳。
0.63
September 2020. Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner.
In International Conference on Learning Representations, September 2020.
国際会議において 略称は2020年9月。
0.72
Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi.
Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi 訳抜け防止モード: Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani ヨハンス・トッツ(Johannes Totz)、ゼハン・ワン(Zehan Wang)、ウェンゼ・シー(Wnzhe Shi)。
0.88
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.
生成逆数ネットワークを用いたフォトリアリスティック単一画像超解法
0.65
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2017.
ieee conference on computer vision and pattern recognition (ieee conference on computer vision and pattern recognition) 2017ページ。 訳抜け防止モード: IEEE Conference on Computer Vision and Pattern Recognition に参加して 4681-4690、2017年。
0.78
Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge.
レオン・a・ガティス、アレクサンダー・s・エッカー、マティアス・ベスゲ。
0.44
A Neural Algorithm of Artistic Style.
芸術的スタイルのニューラルアルゴリズム
0.60
August 2015. Kelvin C. K. Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Chen Change Loy.
2015年8月。 Kelvin C. K. Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, Chen Change Loy
0.36
GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution.
GLEAN:大容量画像超解像用潜在銀行。
0.56
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14245–14254, 2021.
The Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 14245–14254, 2021。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 14245-14254頁、2021年。
0.86
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong.
Cem Subakan、Mirco Ravanelli、Samuele Cornell、Mirko Bronzi、Jianyuan Zhong。
0.32
Attention is All You Need in
注意は必要なもの全て
0.67
Speech Separation. arXiv:2010.13154 [cs, eess], March 2021.
スピーチの分離。 arXiv:2010.13154 [cs, eess], March 2021
0.60
Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam.
Spleeter: A fast and efficient music source separation tool with pre-trained models.
Spleeter: 事前トレーニングされたモデルを備えた,高速で効率的な音楽ソース分離ツールです。
0.61
Journal of Open Source Software, 5(50):2154, 2020.
journal of open source software, 5(50):2154, 2020を参照。
0.88
doi:10.21105/joss.02 154.
doi:10.21105/joss.02 154
0.33
Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Umé, Mr Dpfks, Carl Shift Facenheim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang.
Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Suugasa Marangonda, Chris Umé, Mr Dpfks, Carl Shift Facenheim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, Weiming Zhang 訳抜け防止モード: イヴァン・ペロフ、ダイヘン・ガオ、ニコライ・チェルヴォニー、クンリン・リウ sugasa marangonda、chris umé、mr dpfks、carl shift facenheim。 luis rp, jian jiang, sheng zhang, pingyu wu, ボー・シュウと ウィーミング・zhang
0.45
DeepFaceLab: Integrated, flexible and extensible face-swapping framework.
DeepFaceLab: 統合的でフレキシブルで拡張可能なフェイススワッピングフレームワーク。
0.68
arXiv:2005.05535 [cs, eess], June 2021.
arXiv:2005.05535 [cs, eess], June 2021
0.48
17
17
0.42
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky.
Progressive and Aligned Pose Attention Transfer for Person Image Generation.
進歩とアライメント 人物画像生成のための注意移動のポーズ。
0.59
arXiv:2103.11622 [cs], March 2021.
arXiv:2103.11622 [cs], March 2021
0.47
Moritz Kappel, Vladislav Golyanik, Mohamed Elgharib, Jann-Ole Henningson, Hans-Peter Seidel, Susana Castillo, Christian Theobalt, and Marcus Magnor.
Moritz Kappel, Vladislav Golyanik, Mohamed Elgharib, Jann-Ole Henningson, Hans-Peter Seidel, Susana Castillo, Christian Theobalt, Marcus Magnor 訳抜け防止モード: Moritz Kappel, Vladislav Golyanik, Mohamed Elgharib, Jann - Ole Henningson, Hans - Peter Seidel, Susana Castillo, Christian Theobalt マーカス・マグナー(Marcus Magnor)。
0.79
High-Fidelity Neural Human Motion Transfer From Monocular Video.
モノクルビデオからの高忠実性ニューラルヒューマンモーショントランスファー
0.64
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1541–1550, 2021.
The Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 1541–1550, 2021。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 1541-1550、2021年。
0.83
Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang.
Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, Thomas S. Huang
0.37
Free-Form Image Inpainting With Gated Convolution.
ゲート畳み込みによる自由形画像のインパインティング
0.53
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4471–4480, 2019.
ieee/cvf international conference on computer vision(ieee/cvf international conference on computer vision)は、2019年の第4471-4480頁。
0.46
Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy.
rui xu, xiaoxiao li, bolei zhou, chenはロイを交代する。
0.51
Deep Flow-Guided Video Inpainting.
deep flow-guided video inpainting(英語)
0.51
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2019.
訴訟の手続において IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 3723–3732, 2019
0.49
Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo.
チャン・ハン・リー、ジヴァイ・リウ、リンジュン・ウー、ピン・ルー。
0.40
MaskGAN: Towards Diverse and Interactive Facial Image Manipulation.
MaskGAN: 対話型顔画像マニピュレーションを目指して
0.63
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5549–5558, 2020.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 5549–5558, 2020。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 5549-5558頁、2020年。
0.83
Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou.
ユジュン・シン、ジンジン・グ、Xiaoo Tang、Bolei Zhou。
0.63
Interpreting the Latent Space of GANs for Semantic Face Editing.
意味的顔編集のためのGANの潜時空間の解釈
0.73
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9243–9252, 2020.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 9243–9252, 2020。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 9243-9252、2020年。
0.82
Yuri Viazovetskyi, Vladimir Ivashkin, and Evgeny Kashin.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R.J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu 訳抜け防止モード: jonathan shen氏、ruoming pang氏、ron j. weiss氏、mike schuster氏。 navdeep jaitly, zongheng yang, zhifeng chen, yu zhang, yuxuan wang, r. j. skerry - ryan, rif a. saurous ヤニス・アギオミルギアナキス(yannis agiomyrgiannakis)とヨングイ・ウー(yonghui wu)。
0.65
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.
メルスペクトログラム予測に基づく条件付きウェーブネットによる自然TTS合成
0.65
arXiv:1712.05884 [cs], February 2018.
arxiv:1712.05884 [cs], 2018年2月。
0.53
Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever 訳抜け防止モード: aditya ramesh氏、mikhail pavlov氏、gabriel goh氏、scott gray氏。 チェルシー・ボス、アレク・ラドフォード、マーク・チェン、イリヤ・サツキャヴァー。
0.66
Zero-Shot Text-to-Image Generation.
ゼロショットテキスト・ツー・イメージ生成
0.43
arXiv:2102.12092 [cs], February 2021.
arXiv:2102.12092 [cs], February 2021
0.47
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila
0.34
Analyzing and Improving the Image Quality of StyleGAN.
分析と改善 StyleGANの画質。
0.57
arXiv:1912.04958 [cs, eess, stat], March 2020.
arXiv:1912.04958[cs, eess, stat]、2020年3月。
0.39
Prafulla Dhariwal and Alex Nichol.
Prafulla DhariwalとAlex Nichol。
0.40
Diffusion Models Beat GANs on Image Synthesis.
画像合成における拡散モデル
0.53
arXiv:2105.05233 [cs, stat],
arXiv:2105.05233 [cs, stat]
0.42
June 2021. Patrick Esser, Robin Rombach, and Bjorn Ommer.
2021年6月。 パトリック・エッサー、ロビン・ロムバッハ、ビョルン・オマー。
0.41
Taming Transformers for High-Resolution Image Synthesis.
高分解能画像合成用テーピングトランス
0.75
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 12873–12883, 2021。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 12873-12883、2021年。
0.86
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.
Image2StyleGAN: How to Embed Images Into the StyleGAN Latent
Image2StyleGAN:Style GANラテントに画像を埋め込む方法
0.76
Space? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4432–4441, 2019.
宇宙? ieee/cvf international conference on computer vision(ieee/cvf international conference on computer vision)は、2019年の第4432-4441頁。
0.58
Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or.
Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, Daniel Cohen-Or 訳抜け防止モード: Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan Yaniv Azar, Stav Shapiro, and Daniel Cohen - オールムービー(英語)
0.85
Encoding in Style: A StyleGAN Encoder for Image-to-Image Translation.
Encoding in Style: A StyleGAN Encoder for Image-to- Image Translation (英語)
0.69
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2287–2296, 2021.
The Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 2287–2296, 2021。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 2287-2296頁、2021年。
0.84
Joshua B. Tenenbaum, Vin de Silva, and John C. Langford.
ジョシュア・B・テネンバウム、ヴィン・デ・シルバ、ジョン・C・ラングフォード。
0.49
A Global Geometric Framework for Nonlinear ISSN 0036-8075, 1095-9203.
非線形ISSN 0036-8075, 1095-9203のグローバル幾何学的枠組み
0.66
Dimensionality Reduction. Science, 290(5500):2319–2323, December 2000.
次元減少。 290(5500):2319–2323, 2000年12月。
0.50
doi:10.1126/science. 290.5500.2319.
doi:10.1126/science. 290.5500.2319
0.12
Gunnar Carlsson.
ガンナー・カールソン。
0.64
Topology and data.
トポロジーとデータ。
0.70
Bulletin of the American Mathematical Society, 46(2):255–308, 2009.
アメリカ数学会誌46(2):255–308、2009年。
0.59
ISSN 0273-0979, 1088-9485.
ISSN 0273-0979, 1088-9485.
0.38
doi:10.1090/S0273-09 79-09-01249-X.
Doi:10.1090/S0273-09 79-09-01249-X
0.09
Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan.
チャールズ・フェファーマン、サティニー・ミッター、ハリハラン・ナラヤナン。
0.38
Testing the manifold hypothesis.
多様体仮説をテストする。
0.61
Journal of the American
Journal of the American (英語)
0.61
Mathematical Society, 29(4):983–1049, October 2016.
数理学会、29(4):983–1049、2016年10月。
0.65
ISSN 0894-0347, 1088-6834.
ISSN 0894-0347, 1088-6834
0.36
doi:10.1090/jams/852 .
doi:10.1090/jams/852 。
0.30
Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
イアン・グッドフェロー、ヨシュア・ベンジオ、アーロン・クールヴィル。
0.44
Deep Learning.
ディープラーニング。
0.39
MIT Press, 2016.
2016年、MIT出版。
0.65
Yoshua Bengio, Aaron Courville, and Pascal Vincent.
ヨシュア・ベンジオ、アーロン・クールヴィル、パスカル・ヴィンセント。
0.55
Representation Learning: A Review and New Perspectives.
表現学習: レビューと新たな視点。
0.70
IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, August 2013.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, August 2013
0.45
ISSN 1939-3539.
ISSN 1939-3539。
0.36
doi:10.1109/TPAMI.20 13.50.
Doi:10.1109/TPAMI.20 13.50
0.28
Hang Shao, Abhishek Kumar, and P. Thomas Fletcher.
Hang Shao、Abhishek Kumar、P. Thomas Fletcher。
0.40
The Riemannian Geometry of Deep Generative Models.
深い生成モデルのリーマン幾何学。
0.60
arXiv:1711.08014 [cs, stat], November 2017.
arXiv:1711.08014 [cs, stat], November 2017
0.46
Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk.
エド・コリンズ、ラジャ・バラ、ボブ・プライス、サビーン・サストルク。
0.36
Editing in Style: Uncovering the Local Semantics of GANs.
スタイルで編集する: GANのローカルセマンティクスを明らかにする。
0.78
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5771–5780, 2020.
The Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 5771–5780, 2020。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 5771-5780、2020年。
0.84
Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris.
Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, Sylvain Paris
0.32
GANSpace: Discovering Interpretable GAN
GANSpace: 解釈可能なGANを発見する
0.63
Controls. arXiv:2004.02546 [cs], December 2020.
コントロール。 arXiv:2004.02546 [cs], December 2020
0.60
Zongze Wu, Dani Lischinski, and Eli Shechtman.
Zongze Wu、Dani Lischinski、Eli Shechtman。
0.31
StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation.
StyleSpace Analysis: StyleGAN画像生成のためのアンタングル制御。
0.87
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12863–12872, 2021.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 12863–12872, 2021。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 12863-12872、2021年。
0.86
Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner.
Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, Alexander Lerchner
0.36
Towards a Definition of Disentangled Representations.
絡み合った表現の定義に向けてです
0.62
arXiv:1812.02230 [cs, stat], December 2018.
arxiv:1812.02230 [cs, stat], 2018年12月。
0.56
Rameen Abdal, Peihao Zhu, Niloy Mitra, and Peter Wonka.
Rameen Abdal、Peihao Zhu、Niloy Mitra、Peter Wonka。
0.33
StyleFlow: Attribute-conditione d Exploration of StyleGAN-
StyleFlow:StyleGANの属性条件探索
0.83
Generated Images using Conditional Continuous Normalizing Flows.
条件付き連続正規化流を用いた画像生成
0.75
arXiv:2008.02401 [cs], September 2020.
arXiv:2008.02401 [cs], September 2020
0.46
19
19
0.42
英語(論文から抽出)
日本語訳
スコア
Deep Learning and Synthetic Audiovisual Media
深層学習と合成視覚メディア
0.69
Forthcoming in Synthese Ali Jahanian, Lucy Chai, and Phillip Isola.
合成のこれから アリ・ジャハニアン、ルーシー・チャイ、フィリップ・イゾラ。
0.59
On the "steerability" of generative adversarial networks.
生成型逆ネットワークの「ステアビリティ」について
0.65
arXiv:1907.07171
arXiv:1907.07171
0.20
[cs], February 2020. Willi Menapace, Stephane Lathuiliere, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10061–10070, 2021.
The Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 10061–10070, 2021。 訳抜け防止モード: IEEE / CVF Conference on Computer Vision and Pattern Recognition に参加して 10061-10070年、2021年。
0.85
Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler.