2,132 reads

Efficient NeRFs for Real-Time Portrait Synthesis (RAD-NeRF)

by Louis BouchardDecember 5th, 2022

Too Long; Didn't Read

We’ve heard of deepfakes, we’ve heard of NeRFs, and we’ve seen these kinds of applications allowing you to recreate someone’s face and pretty much make him say whatever you want. What you might not know is how inefficient those methods are and how much computing and time they require. Plus, we only see the best results. Keep in mind that what we see online is the results associated with the faces we could find most examples of, so basically, internet personalities and the models producing those results are trained using lots of computing, meaning expensive resources like many graphics cards. Still, the results are really impressive and only getting better.

Companies Mentioned

featured image - Efficient NeRFs for Real-Time Portrait Synthesis (RAD-NeRF)

We’ve heard of deepfakes, we’ve heard of NeRFs, and we’ve seen these kinds of applications allowing you to recreate someone’s face and pretty much make him or her say whatever you want.

What you might not know is how inefficient those methods are and how much computing and time they require. Plus, we only see the best results. Keep in mind that what we see online is the results associated with the faces we could find most examples of, so basically, internet personalities and the models producing those results are trained using lots of computing, meaning expensive resources like many graphics cards. Still, the results are really impressive and only getting better.Fortunately, some people like Jiaxian Tang and colleagues are working on making those methods more available and From a single video, they can synthesize the person talking for pretty much any word or sentence in real time with better quality. You can animate a talking head following any audio track in real-time. This is both so cool and so scary at the same time...

Learn more in the video

References

►Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G. and Wang, J., 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368.
►Results/project page:

Video Transcript

0:02[Music]0:07we've heard of deep fakes we've heard of0:09Nerfs and we've seen these kinds of0:11applications allowing you to recreate0:13someone's face and pretty much make him0:15say whatever you want what you might not0:17know is how inefficient those methods0:20are and how much Computing and time they0:22require plus we only see the best0:24results keep in mind that what we see0:26online is the results associated with0:29the faces we could find most examples of0:31so basically internet personalities and0:34the models producing those results are0:36trained using lots of computing meaning0:38expensive resources like many graphics0:41cards still the results are really0:43impressive and only getting better0:45fortunately some people like Jackson0:47tang and colleagues are working on0:49making those methods more available and0:52effective with a new model called red0:54Nerf but let's hear that from their own0:57model hello thanks for watching the0:59supplementary video for our paper1:00real-time neural Radiance talking head1:03synthesis via decomposed audio spatial1:05encoding1:06our method is person-specific and only1:08needs a three to five minutes monocular1:10video for training1:11after training the model can synthesize1:14realistic Talking Heads driven by1:15arbitrary audio in real time while1:17keeping comparable or better rendering1:19quality compared to previous methods so1:21you heard that right from a single video1:23they can synthesize the person talking1:26for pretty much any word or sentence in1:28real time with better quality you can1:30animate a talking head following any1:33audio track in real time this is both so1:36cool and so scary at the same time just1:39imagine what could be done if we could1:40make you say anything at least they1:43still need access to a video of you1:45speaking in front of the camera for 51:47minutes so it's hard to achieve that1:48without you knowing still as soon as you1:51appear online anyone will be able to use1:53such a model and create infinite videos1:56of you talking about anything they want1:58they can even host live streams with2:00this method which is even more dangerous2:03and makes it even harder to say wetsuit2:05or not anyways even though this is2:08interesting and I'd love to hear your2:10thoughts in the comments and keep the2:11discussion question going here I wanted2:13to cover something that is only positive2:15and exciting science more precisely how2:19did they achieve to animate Talking2:20Heads in real time from any audio using2:23only a video of the face as they State2:26their red Nerf model can run 500 times2:29faster than the previous works with2:31better rendering quality and more2:33control you may ask how is that possible2:36we usually trade quality for efficiency2:39yet they achieve to improve both2:41incredibly these immense improvements2:43are possible thanks to three main points2:46the first two are related to the2:48architecture of the model more2:50specifically how they adapted the Nerf2:52approach to make it more efficient and2:54with improved motions of the Torso and2:57head the first step is to make nerves2:59more efficient I won't dive into how3:02Nerfs work since we covered it numerous3:04time basically it's an approach based on3:06neural networks to reconstruct 3D3:09volumetric scenes from a bunch of 2D in3:11images which means regular images this3:14is why they will take a video as input3:17as it basically gives you a lot of3:19images of a person from many different3:21angles so it usually uses a network to3:24predict all pixels colors and densities3:26from the camera Viewpoint you are3:28visualizing and does that for all3:31viewpoints you want to show when3:32rotating around the subject which is3:34extremely computation hungry as you are3:37predicting multiple parameters for each3:39coordinate in the image every time and3:41you are learning to predict all of them3:43Plus in their case it isn't only a Nerf3:46producing or 3D scene it also has to3:49match an audio input and fit the lips3:51mouth eyes and movements with what the3:53person says instead of predicting all3:56pixels densities and colors matching the3:58audio for a specific frame they will4:00work with two separate new and condensed4:03spaces called grid spaces or grid-based4:06Nerf they will translate their4:08coordinates into a smaller 3D grid space4:11trans laid their audio into a smaller 2D4:13grid space and then send them to render4:16the head this means they never merge the4:19audio data with the spatial data which4:22will increase the size exponentially4:23adding two dimensional inputs to each4:26coordinate so reducing the size of the4:29audio features along with keeping the4:31audio and spatial features separate is4:34what makes the approach so much more4:36efficient but how can the results be4:38better if they use condensed spaces that4:40have less information adding a few4:42controllable features like an eye4:44blinking control to our grid Nerf the4:47model will learn more realistic4:48behaviors for the eyes compared to4:51previous approaches something really4:53important for realism the second4:55Improvement they've done is to model the4:57Torso with another Nerf using the same5:00approach instead of trying to model it5:02with the same Nerf used further head5:04which will require much fewer parameters5:07and different needs as the goal here is5:09to animate moving heads and not whole5:12bodies since the Torso is pretty much5:14static in these cases they use a much5:16simpler and more efficient Nerf based5:18module that only works in 2D working in5:21the image space directly instead of5:24using camera arrays as we usually do5:26with Nerf to generate many different5:28angles which aren't needed for a torso5:30so it is basically much more efficient5:32because they modified the approach for5:35this very specific use case of the rigid5:37torso and moving head videos they then5:40recompose the head with the Torso to5:42produce the final video and voila this5:45is how you produce talking head videos5:47over any audio input super efficiently5:50of course this was just an overview of5:53this new exciting research publication5:55and they do other modifications during5:57the training of their algorithm to make5:59it more efficient which is the third6:01point I mentioned at the beginning of6:03the video if you were wondering I invite6:05you to read their paper for more6:07information the link is in the6:09description below before you leave I6:10just wanted to thank the people who6:12recently supported this channel through6:14patreon this is not necessary and6:16strictly to support the work I do here6:18huge thanks to artem vladiken Leopoldo6:22Alta Murano J Cole Michael carichao6:25Daniel gimness and a few Anonymous6:28generous donors it will be greatly6:30appreciated if you also want and can6:33afford to support my work financially6:35the link to my patreon page is in the6:37description below as well but no worries6:39if not a sincere comment below this6:42video is all I need to be happier I hope6:45you've enjoyed this video and I will see6:47you next week with another amazing paper6:51[Music]

L O A D I N G
. . . comments & more!