In a new paper titled Total Relighting, a research team at Google presents a novel per-pixel lighting representation in a deep learning framework. This explicitly models the diffuse and the specular components of appearance, producing relit portraits with convincingly rendered effects like specular highlights. This would be super cool to use in your next zoom meeting! See how it works and what it can do below.
Watch the video
References
The article:
Paper:
Project link:
Full reference: Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872
Video Transcript
00:00Have you ever wanted to change the background of a picture but have it look realistic?00:04If you’ve already tried that, you already know that it isn’t simple.00:08You can’t just take a picture of yourself in your home and change the background for00:12a beach.00:13It just looks bad and not realistic.00:15Anyone will just say “that’s photoshopped” in a second.00:18For movies and professional videos, you need the perfect lighting and artists to reproduce00:23a high-quality image, and that’s super expensive.00:26There’s no way you can do that with your own pictures.00:29Or can you?00:30Well, this is what Google Research is trying to achieve with this new paper called Total00:36Relighting.00:37The goal is to properly relight any portrait based on the lighting of the new background00:41you add.00:42This task is called “Portrait relighting and background replacement”, which, as its00:47name says, has two very complicated sub-tasks: Background replacement, meaning that you will00:53need to accurately remove the current image’s background to only have your portrait.00:59Portrait relighting, where you will adapt your portrait based on the lighting in the01:04new background’s scene.01:05As you may expect, both these tasks are extremely challenging as the algorithm needs to understand01:11the image to properly remove you out of it and then understand the other image enough01:16to change the lighting of your portrait to make it fit the new scene.01:20The most impressive thing about this paper is that these two tasks are made without any01:24priors.01:25Meaning that they do not need any other information than two pictures: your portrait and the new01:30background to create this new realistic image.01:33Let’s get back to how they attacked these two tasks in detail:01:36This first task of removing the background of your portrait is called image matting,01:41or in this case, human matting, where we want to identify a human in a picture accurately.01:47The ‘accurate’ part makes it complex because of many fine-grain details like the floating01:52hair humans have.01:54You can’t just crop out the face without the hair.01:57It will just look wrong.01:58To achieve this, they need to train a model that can first find the human, then predict02:03an approximate result where we specify what we are sure is part of the person, what is02:08part of the background, and what is unsure.02:11This is called a trimap, and it is found using a classic segmentation system trained to do02:17exactly that: segment people in images.02:21This trimap is then refined using an encoder-decoder architecture, as I already explained in a02:25previous video if you are interested.02:28It basically takes this initial trimap, downscale it into condensed information, and uses this02:33condensed information to upscale this into a better trimap.02:38This may seem like magic, but it works because the network transforming this trimap into02:43code and code into a better trimap was trained on thousands of examples and learned how to02:48achieve this.02:49Then, they use this second trimap to again refine it into the final predicted human shape,02:55which is called an alpha matte.02:57This step also uses a neural network.02:59So we basically have three networks involved here, one that takes the image and generates03:05a trimap, a second that takes this image and trimap to improve the trimap, and the last03:10one that takes all these as inputs to generate the final alpha matte.03:14All these sub-steps are learned during training, where we show many examples of what we want03:20to the networks working together to improve the final result iteratively.03:24Again, it is very similar to what I previously covered in my video about MODNet, a network03:31doing precisely that, if you want more information about human matting.03:36Here, all these networks composed only the first step of this algorithm: the human matting.03:43What’s new with this paper is the second real step, which they refer to as the relighting03:49module.03:50Now that we have an accurate prediction of where the person is in the image, we need03:54to make it look realistic.03:57To do so, it is very important that the lighting on the person matches the background, so they04:02need to either relight the person or the background scene.04:05Here, as most would agree, the simplest is to relight the person, so they aimed for this.04:10This relighting was definitely the most complex task between the two as they needed to understand04:15how the human body reacts to light.04:17As you can see here, there are multiple networks here again.04:20The geometry net, an albedo net, and a shading net.04:24The geometry net takes the input foreground we produced on the previous step to produce04:29surface normals.04:30This is a modelization of the person’s surface so that the model can understand the depths04:36and light interactions.04:38Then, this surface normal is coupled with the same foreground image and sent into an04:44albedo net that produces the albedo image.04:47This albedo image is simply a measure of the proportion of light reflected by our object04:52of interest, which is a person, in this case, reacting to light from different sources.04:57It tells us how the clothing and skin of the person react to the light it receives, helping05:03us for the next step.05:04This next step has to do with the light of the new background.05:07We will try to understand how the new background lighting affects our portrait using learned05:12specular reflectance and diffuse light representations of our portrait here called light maps.05:19These light maps are calculated using a panoramic view of your wanted background.05:23Just like the name says, these light maps basically show how the light interacts with05:27the subject in many situations.05:30These maps allow us to make the skin and clothing appear shinier or more matte depending on05:35the background’s lighting.05:37Then, these light maps, the albedo image, and the foreground are merged into the final05:43and third network, the shading network.05:46This shading network first produces a final version of the specular light map using the05:51albedo information coupled with all the specular light map candidates we calculated previously.05:57Using this final light map, our diffuse map, and the albedo, we can finally render the06:01final relit person ready to be inserted on our new background.06:06As you saw, all the networks looked the same, exactly like this, which is called a U-Net,06:12or encoder-decoder architecture.06:16Just like I already said, it takes an input, condenses it into codes representing this06:21input, and upscale it into a new image.06:24But as I already explained in previous videos, these ‘encoder-decoders’ just take an06:29image into the first part of the network, which is the encoder that transforms it into06:34condensed information called latent code that you can see here on the right.06:39This information basically contains the relevant information to reconstruct the image based06:44on whatever style we want it to have.06:47Using what they learned during training, the decoder does the reverse step using this information06:51to produce a new image with this new style.06:54This style can be a new lighting orientation, but also a completely different image like06:58a surface map or even an alpha matte, just like in our first step.07:03This technique is extremely powerful, mainly because of the training they did.07:06Here, they used 58 cameras with multiple lights and 70 different individuals doing various07:13poses and expressions.07:15But don’t worry, this is only needed for training the algorithm.07:19The only thing needed at inference time is your picture and your new background.07:23Also, you may recall that I mentioned a panoramic view was needed to produce this re-lightened07:28image, but it can also be accurately approximated with another neural network based on only07:33the background picture you want your portrait to be translated on.07:37And that’s it!07:38Merging these two techniques together makes it, so you just have to give two images to07:42the algorithm, and it will do everything for you, producing a realistically re-lightened07:46portrait of yourself with a different background!07:49This paper by Pandey et al. applies it to humans, but you can imagine how useful it07:55could be on objects as well where you can just take pictures of objects and put them07:59in a new scene with the correct lighting to make them look real.08:09Thank you for watching!