visit
Say goodbye to complex GAN and transformer architectures for image generation. This new method by Chenling Meng et al. from Stanford University and Carnegie Mellon University can generate new images from any user-based inputs.
Even people like me with zero artistic skills can now generate beautiful images or modifications out of quick sketches. It may sound weird at first, but by just adding noise to the input, they can smooth out the undesirable artifacts, like the user edits, while preserving the overall structure of the image.
So the image now looks like this, complete noise, but we can still see some shapes of the image and stroke, and specific colors. This new noisy input is then sent to the model to reverse this process and generate a new version of the image following this overall structure.
Meaning that it will follow the overall shapes and colors of the image, but not so precisely that it can create new features like replacing this sketch with a real-looking beard. Learn more in the video and watch the amazing results!
►Read the full article:
►My Newsletter (A new AI application explained weekly to your emails!):
►SDEdit, Chenlin Meng et al., 2021,
►Project link:
►Code:
►Demo:
00:00
say goodbye to complex GAN and
00:02
transformer architectures for image
00:03
generation
00:04
this new method by channing meng el from
00:07
stanford university and carnegie mellon
00:09
university can generate new images from
00:12
any user based inputs even people like
00:14
me
00:15
with zero artistic skills can now
00:17
generate beautiful images
00:18
or modifications out of quick sketches
00:21
it may sound weird at first but just by
00:23
adding noise to the input
00:25
they can smooth out the undesirable
00:26
artifacts like the user edits
00:28
while preserving the overall structure
00:30
of the image so the image now looks like
00:32
this
00:33
complete noise but we can still see some
00:35
shapes of the image and strokes and
00:37
specific colors
00:38
this new noisy input is then sent to the
00:40
model to reverse this process
00:42
and generate a new version of the image
00:44
following this overall structure
00:46
meaning that it will follow the overall
00:48
shapes and colors
00:49
of the image but not so precisely that
00:51
it can create
00:52
new features like replacing the sketch
00:54
with a real looking beard
00:56
the same way you can send a complete
00:58
draft of an image like this
01:00
add noise to it and it will remove the
01:02
noise by simulating the reverse steps
01:04
this way it will gradually improve the
01:06
quality of the generated image following
01:08
a specific dataset style
01:10
from any input this is why you don't
01:12
need any drawing skills anymore
01:14
since it generates an image from noise
01:16
it has no id and doesn't need to know
01:19
the initial input before applying noise
01:21
this is a big difference and a huge
01:23
advantage compared to other generative
01:25
networks
01:26
like conditional GANs where you train a
01:28
model to go from one style to another
01:30
with image pairs coming from two
01:32
different but related data sets
01:34
by the way if you find this interesting
01:36
don't forget to subscribe like the video
01:38
and share it with your friends or
01:39
colleagues
01:40
it helps a lot thank you this model
01:42
called sd edits
01:44
uses stochastic differential equations
01:46
or sdes
01:47
which means that by injecting gaussian
01:49
noise they transform
01:50
any complex data distribution into a
01:53
known prior
01:54
distribution this known prior
01:56
distribution is seen
01:57
during training and this is what the
01:59
model is trained on to reconstruct the
02:01
image
02:02
so the model learns how to transform
02:04
this gaussian noisy input
02:05
into a less noisy image and repeats this
02:08
step until we have an image
02:10
following the one style this method
02:12
works with whatever type of input
02:14
because if you add enough noise to it
02:16
the image will become so noisy that it
02:18
joins the known distribution
02:20
then the model can take this known
02:22
distribution and
02:23
do the reverse steps denoising the image
02:26
based on what it was trained on
02:28
indeed just like GANs we need a target
02:31
dataset
02:32
which is the kind of data or images we
02:34
want to generate
02:35
for example to generate realistic faces
02:37
we need a data set
02:38
full of realistic faces then we add
02:41
noise to these face
02:42
images and teach the model to denoise
02:45
them iteratively and this is the beauty
02:47
of this model
02:48
because once it has learned how to
02:50
denoise an image we can pretty much do
02:52
anything to the image
02:53
before adding noise to it like adding
02:55
strokes since they are blended within
02:57
the expected image distribution
02:59
from the noise we are adding typically
03:02
editing an image based on
03:04
such strokes is a challenging task for a
03:06
gan architecture
03:07
since these strokes are extremely
03:08
different from the image and from what
03:10
the model has seen
03:12
during training a GAN architecture will
03:14
need two data sets to fix this
03:16
the target data set which will be the
03:17
one we try to imitate and a source data
03:20
set which is the images with strokes
03:22
that we are trying to edit these are
03:25
called paired
03:26
datasets because we need each image to
03:28
come in pairs
03:29
in both data sets to train our model on
03:32
we also need to define a proper loss
03:34
function to train it
03:35
making the image synthesis process very
03:38
expensive and time consuming
03:40
in our case with sd edits we do not need
03:43
any paired data sets since the stroke
03:45
and the image styles are merged
03:47
because of this noise this makes the new
03:49
noisy image part of the known data
03:52
for the model which uses it to generate
03:54
a new image very similar to the training
03:56
data set
03:57
but taking the new structure into
03:59
account in other words
04:00
it can easily take an edited image as
04:03
input
04:03
blurs it just enough but not too much to
04:06
keep global semantics and structural
04:08
detail
04:09
and denoise it to produce a new image
04:11
that magically takes your edits into
04:13
account
04:14
and the model wasn't even trained with
04:16
strokes or edits examples only with the
04:19
original images
04:20
of course in the case of a simple user
04:23
edit
04:23
they carefully designed the architecture
04:25
to only generate the edited part and not
04:27
recreate
04:28
the whole picture this is super cool
04:30
because it enables applications such as
04:32
conditional image generation
04:34
stroke based image synthesis and editing
04:37
image and painting colorization and
04:39
other inverse problems to be solved
04:41
using a single unconditional modal
04:44
without
04:45
retraining it of course this will still
04:47
work
04:48
for only one generation style which will
04:50
be the data set it was trained on
04:52
however it's still a big advantage as
04:55
you only need one data set
04:56
instead of multiple related data sets
04:59
with a GAN based
05:00
image and painting network as we
05:02
discussed the only downside
05:04
may be the time needed to generate the
05:05
new image as
05:07
this iterative process takes much more
05:09
time than a single pass
05:10
through a more traditional gan based
05:12
generative model
05:13
still i'd rather wait a couple of
05:15
seconds to have
05:16
great results for an image than having a
05:18
blurry fail
05:19
in real time you can try it yourself
05:22
with the code they made publicly
05:23
available
05:24
or use the demo on their website both
05:26
are linked in the description
05:28
let me know what you think of this model
05:30
i'm excited to see what will happen with
05:32
this
05:32
sd based method in a couple of months or
05:35
even less
05:36
thank you for watching
05:42
[Music]