visit
Authors:
(1) Han Jiang, HKUST and Equal contribution ([email protected]);
(2) Haosen Sun, HKUST and Equal contribution ([email protected]);
(3) Ruoxuan Li, HKUST and Equal contribution ([email protected]);
(4) Chi-Keung Tang, HKUST ([email protected]);
(5) Yu-Wing Tai, Dartmouth College, ([email protected]).
2.1. NeRF Editing and 2.2. Inpainting Techniques
2.3. Text-Guided Visual Content Generation
3.1. Training View Pre-processing
4. Experiments and 4.1. Experimental Setups
5. Conclusion and 6. References
Warmup Training. Our training image pre-processing stage provides a good initialization for rough convergence. Before fine-tuning on these images to get fine convergence, notice that the 3D object in NeRF is still the original object intact, which is quite different from what is depicted in the pre-processed training images above in geometry and appearance. Therefore, for the first stage of NeRF training, without any finetuning, we directly train the NeRF on the preprocessed images for coarse convergence. Note that when we perform fine-tuning later, the original object will have no effect because warmup training has already erased all its appearance information.
Iterative Dataset Update. Given a NeRF in which the editing target has fixed overall geometry, Iterative Dataset Update (IDU), first proposed and shown effective in InstructNerf2Nerf [4], is a useful training image fine-tuning strategy which can edit the appearance and fine geometry of the target object in NeRF. In our task, warmup training has already provided a converged coarse geometry, leaving fine geometry and appearance to be determined. In their work, they use InstructPix2Pix [1] as a diffusion model strongly conditioned on the original training image. In our task, we use stable diffusion in a similar way to achieve similar objectives. The fine-tuned image should be conditioned on the pre-processed image, as well as the depth map. In detail, for pre-processed image conditioning, we adopt the same approach as in the previously introduced projected image correction. The preprocessed image and the current NeRF rendering are blended together, either in image space or in latent space. Then a small amount of noise is injected into the blended image, followed by stable diffusion denoising from an intermediate timestep. For depth map conditioning, we input the depth map of the current NeRF into ControlNet to guide the editing process.
Regularizers. To supervise NeRF training, we use the L1 photometric loss between NeRF rendering and the training images. However, sole supervision of RGB cannot avoid inaccuracies and noise in geometry. To achieve clean converged geometry, we add the following two regularizers into supervision. First, we also render a depth value along a ray, and compute the depth loss as defined in [3] between the rendered depth and our pre-processed 2-layer depth i.e., initial inpainted geometry proxies. Supervision with our planar 2-layer depth is not accurate, but it helps to avoid incorrect converged geometry, such as merging of the inpainted geometry with the background. Second, to deal with noisy RGB and depth renderings and “floaters” (artifacts hovering in the underlying NeRF volume), we use LPIPS loss [40] on RGB as a regularizing term. In practice, we found LPIPS to be effective in reducing noise and floaters. Since LPIPS is based on patches, rays selected in NeRF training are not allowed to be random, but are required to form rectangular patches. Thus we render small image patches, corresponding to rectangular areas in ground truth images, during NeRF optimization. The effect of the regularizers is also previously demonstrated in [31, 34].