We’ve seen models before that were able to take a sentence and .We've also seen other by learning specific concepts like an object or particular style.Last week, Meta published the that I covered, which allows you to generate a short video also from a text sentence. The results aren’t perfect yet, but the progress we’ve made in the field since this past year is just incredible.This week we take another step forward.Here’s DreamFusion, a new Google Research model that can understand a sentence enough to generate a 3D model of it.You can see this as a or but in 3D.How cool is that?! We can’t really make it much cooler.But what’s even more fascinating is how it works. Let’s dive into it...
References
►Read the full article:
►Poole, B., Jain, A., Barron, J.T. and Mildenhall, B., 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988.
►Project website:
►My Newsletter (A new AI application explained weekly to your emails!):
Video Transcript
0:02we've seen models able to take a0:04sentence and generate images then other0:07approaches to manipulate the generated0:09images by learning specific Concepts0:11like an object or a particular style0:13last week meta published the make a0:16video model that I covered which allows0:18you to generate a short video also from0:20a text sentence the results aren't0:22perfect yet but the progress we've made0:24in the field since last year is just0:26incredible this week we make another0:28step forward here's dream Fusion a new0:32Google research model that can0:34understand a sentence enough to generate0:36a 3D model out of it you can see this as0:39a dally or stable diffusion but in 3D0:41how cool is that we can't make it much0:44cooler but what's even more fascinating0:46is how it works let's dive into it but0:49first give me a few seconds to talk0:51about a related subject computer vision0:53you'll want to hear that if you are in0:55this field as well for this video I'm0:57partnering with encord the online1:00learning platform for computer vision1:01data is one of the most important parts1:04of creating Innovative computer vision1:06model that's why the encode platform has1:09been built from the ground up to make1:10the creation of training data and1:12testing of machine learning models1:14quicker than it's ever been encord does1:17this in two ways first it makes it1:19easier to manage annotate and evaluate1:22training data through a range of1:24collaborative annotation tools and1:25automation features secondly encod1:28offers access to its QA workflows apis1:31and SDK so you can create your own1:33Active Learning pipelines speeding up1:35model development and by using encode1:38you don't need to waste time building1:39your own annotation tools letting you1:41focus on getting the right data into1:44your models if that sounds interesting1:46please click the first link below to get1:48a free 28-day trial of encode exclusive1:51to our community1:54if you've been following my work dream1:56Fusion is quite simple it basically use1:59two models I already covered Nerfs and2:02one of the text to image models in their2:04case it's the Imogen model but and you2:07will do like stable diffusion or Dolly2:09as you know if you've been a good2:11student and watched the previous videos2:12Nerfs are a kind of model used to render2:153D scenes by generating neural Radiance2:18field out of one or more images of an2:21object but then how can you generate a2:233D render from text if the Nerf model2:26only works with images well we use2:29imagen the other AI to generate image2:31variations from the one it takes and why2:34do we do that instead of directly2:36generating 3D models from text because2:38it will require huge data sets of 3D2:41data along with their Associated2:43captions for our model to be trained on2:46which will be very difficult to have2:48instead we use a pre-trained text to2:50image model with much less complex data2:53together and we adapt it to 3D so it2:56doesn't require any 3D data to be2:57trained on only a pre-existing AI for3:00generating images it's really cool how3:03we can reuse powerful Technologies for3:05new tasks like this when interpreting3:07the problem differently so if we start3:09from the beginning we have a Nerf model3:12as I explained in previous videos this3:14type of model takes images to predict3:17the pixels in each novel view creating a3:203D model by learning from image pairs of3:22the same object with different3:24viewpoints in our case we do not start3:26with images directly we start with the3:28text and Sample a random view3:30orientation we want to generate an image3:33for basically we are trying to create a3:353D model by generating images of all3:38possible angles a camera could cover3:40looking around the object and guessing3:42the pixels colors densities light3:45Reflections Etc everything needed to3:48make it look realistic thus we start3:50with a caption and add a small tweak to3:52it depending on the random camera3:54viewpoint we want to generate for3:56example we may want to generate a front3:58view so we would append front view to4:01the caption on the other side we use the4:03same angle and camera parameters for4:05initial not trained Nerf model to4:09predict the first rendering then we4:11generate an image version Guided by our4:13caption and initial rendering with added4:17noise using imagine our pre-trained text4:20to image model which I further explained4:22in my image and video if you are curious4:24to see how it does that so our image and4:26model will be guided by the text input4:28as well as the current rendering of the4:30object with added noise here we add4:33noise because this is what the image and4:36module can take as input it needs to be4:38part of a noise distribution it4:40understands we use the model to generate4:43a higher quality image add the image4:45used to generate it and remove the Noise4:48We manually added to use this result to4:51guide and improve our Nerf model for the4:54next step we do all that to better4:55understand where in the image the Nerf4:57model should focus its attention to4:59produce better results for the next step5:01and we repeat that until the 3D model is5:05satisfying enough you can then export5:07this model to mesh and use it in a scene5:10of your choice and before some of you5:12ask no you don't have to retrain the5:15image generator model as they say so5:17well in the paper it just acts as a5:19frozen critic that predicts image space5:21edits and voira this is how dream Fusion5:25generates 3D rendering from text inputs5:28if you'd like to have a deeper5:30understanding of the approach have a5:32look at my videos covering nerves and5:34Imogen I also invite you to read their5:36paper for more details on this specific5:39method thank you for watching the whole5:41video and I will see you next week with5:44another amazing paper