Last year I shared DALL·E, an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL ·E 2, who is four times better at generating photorealistic images from text. The recent model learned a new skill; image inpainting. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background. Learn more in the video!
Companies Mentioned
Last year I shared , an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL·E 2. And you won’t believe the progress in a single year! DALL·E 2 is not only better at generating photorealistic images from text. The results are four times the resolution!As if it wasn’t already impressive enough, the recent model learned a new skill; .DALL·E could generate images from text inputs.DALL·E 2 can do it better, but it doesn’t stop there. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background.Sounds interesting? Learn more in the video!
References
►Read the full article: ►A. Ramesh et al., 2022, DALL-E 2 paper: ►OpenAI's blog post: ►Risks and limitations: ►OpenAI Dalle's instagram page: ►My Newsletter (A new AI application explained weekly to your emails!):
Video Transcript
0:00last year i shared dolly an amazing0:02model by openai capable of generating0:05images from a texan foot with incredible0:08results now it's time for his big0:10brother dolly too and you won't believe0:13the progress in a single year dolly 2 is0:15not only better at generating0:17photorealistic images from texts the0:20results are four times the resolution as0:22if it wasn't already impressive enough0:25the recent model learned a new skill0:27image in painting delhi could generate0:30images from text inputs dolly 2 can do0:33it better but it doesn't stop there it0:35can also edit those images and make them0:38look even better or simply add a feature0:41you want like some flapping goes in the0:43background this is what image and0:45painting is we take the part of an image0:47and replace it with something else0:49following the style and reflections in0:51the image keeping realism of course it0:53doesn't only replace the part of the0:55image at random this will be too easy0:58for openai this in-painting process is1:00also text guided which means you can1:02tell it to add a famine go here there or1:05even there1:06before diving into the nitty-gritty of1:08this newest dahle model let me talk a1:11little about this episode sponsor1:13weights and biases if you are not1:15familiar with weight and biases you are1:17most certainly new here and should1:19definitely subscribe to the channel1:21weight and biases allows you to keep1:22track of all your experiments with only1:25a handful of lines added to your code1:27one feature i love is how you can1:29quickly create and share amazing looking1:31interactive reports like this one1:34clearly showing your team or future self1:36your runs metrics hyperparameters and1:38data configurations alongside any notes1:41you or your team had at the time it's a1:44powerful feature to either add quick1:46comments on an experiment or create1:48polished pieces of analysis reports can1:50also be used as dashboards for reporting1:53a smaller subset of metrics than the1:55main workspace you can even create1:57public view-only links to share with2:00anyone easily capturing and sharing your2:02work is essential if you want to grow as2:04an ml practitioner which is why i2:06recommend using tools that improve your2:08work like weights and biases just try it2:11with the first link below and start2:13sharing your work like a pro2:16now let's dive into how dolly 2 can not2:19only generate images from text but is2:21also capable of editing them indeed this2:24new in-painting skill the network has2:26learned is due to its better2:28understanding of concepts and the images2:30themselves locally and globally what i2:33mean by locally and globally is that2:35dahle 2 has a deeper understanding of2:37why the pixels next to each other has2:40these colors as it understands the2:42objects in the scene and their2:43interrelation to each other this way it2:46will be able to understand that this2:48water has reflection and the object on2:50the right should be also reflected there2:53it also understands the global scene2:55which is what is happening just like if2:58you were to describe what is going on3:00when the person took the photo here3:02you'd say that this photo does not exist3:05obviously or else i'm definitely down to3:07try that if we forget that this is3:09impossible you'd say that the astronaut3:11is riding a horse in space so if i were3:14to ask you to draw the same scene but on3:17a planet rather than in free space you'd3:19be able to picture something like that3:21since you understand that the horse and3:23astronaut are the objects of interest to3:25keep in the picture this seems obvious3:28but it's extremely complex for a machine3:30that only sees pixels of colors which is3:33why dahli 2 is so impressive to me but3:35how exactly does the model understand3:38the text we send it and can generate an3:40image out of it well it's pretty similar3:43to the first model i covered on the3:45channel it starts by using the clip3:47model by openai to encode both a text3:50and an image into the same domain a3:52condensed representation called a latent3:55code then it will take this encoding and3:58use a generator also called a decoder to4:01generate a new image that means the same4:04thing as the text since it's from the4:06same latent code so dali 2 has two steps4:10clip to encode the information and the4:12new decoder model to take this encoded4:15information and generate an image out of4:17it these two separated steps are also4:20why we can generate variations of the4:22images we can simply randomly change the4:25encoded information just a little making4:27it move a tiny bit in the latent space4:30and it will still represent the same4:32sentence while having all different4:34values creating a different image4:36representing the same text as we see4:39here it initially takes a text input and4:42encodes it what we see above is the4:44first step of the training process where4:46we also feed it an image and encode it4:48using clip so that images and text are4:51encoded similarly following the clip4:53objective then for generating a new4:56image we switch to the section below4:58where we use the text encoding guided by5:00clip to transform it into an image ready5:03encoding this transformation is done5:05using a diffusion prior which we will5:07cover shortly as it is very similar to5:09the diffusion model used for the final5:12step finally we use our newly created5:14image encoding and decode it into a new5:17image using the diffusion decoder a5:20diffusion decoder or modal is a kind of5:23model that starts with random noise and5:25learns how to iteratively change this5:28noise to get back to an image it learns5:30that by doing the opposite during5:32training we will feed it images and5:34apply random gaussian noise on the image5:37iteratively until we can't see anything5:40other than noise then we simply reverse5:43the model to generate images from noise5:45if you'd like more detail about this5:47kind of network which are really cool i5:50invite you to watch this video i made5:51about them and voila this is how dali 25:55generates such high quality images5:58following text it's super impressive and6:00tells us that the model does understand6:02the text but does it deeply understand6:05what it created6:06well it sure looks like it it's the6:08capability of impainting images that6:10makes us believe that it does understand6:12the pictures pretty well but why is that6:15so how can it link a text input to an6:18image and understand the image enough to6:20replace only some parts of it without6:23affecting the realism this is all6:25because of clip as it links a text input6:28to an image if we encode back our newly6:30generated image and use a different text6:33input to guide another generation we can6:35generate the second version of the image6:38that will replace only the wanted region6:40in our first generation and you will end6:43up with this picture unfortunately the6:46code isn't publicly available and is not6:48in their api yet the reason for that as6:51per openai is to study the risks and6:53limitations of such a powerful model6:56they actually discuss these potential6:58risks and the reason for this privacy in7:00their paper and in a great repository i7:02linked in the description below if you7:04are interested they also opened an7:06instagram account to share more results7:08if you'd like to see that it's also7:10linked below i loved dally and this one7:13is even cooler7:15of course this was just an overview of7:17how dahli2 works and i strongly invite7:19reading their great paper linked below7:21for more detail on their implementation7:23of the model i hope you enjoyed this7:26video as much as i enjoyed making it and7:28i will see you next week with anotheramazing paper thank you for watching