264 reads

3D Articulated Shape Reconstruction from Videos

by Louis BouchardMay 20th, 2021

Too Long; Didn't Read

3D Articulated Shape Reconstruction from Videos is a new method for generating 3D models of humans or animals moving from only a short video as input. Google Research, along with Carnegie Mellon University, just published a paper called LASR.Learn more about the project below.Watch the video with Louis Bouchard at the bottom of the page. Read the full article: //www.louisbouchard.ai/3d-reconstruction-from-videos.

Companies Mentioned

Coin Mentioned

featured image - 3D Articulated Shape Reconstruction from Videos

With LASR, you can generate 3D models of humans or animals moving from only a short video as input. This task is called 3D reconstruction, and Google Research, along with Carnegie Mellon University, just published a paper called LASR: Learning Articulated Shape Reconstruction from a Monocular Video.Learn more about the project below.

Watch the video

References

References:►Read the full article: ►Gengshan Yang et al., (2021), LASR: Learning Articulated Shape Reconstruction from a Monocular Video, CVPR,

Video Transcript

00:00How hard is it for a machine to understand an image?00:03Researchers have made a lot of progress in image classification, image detection, and00:08image segmentation.00:09These three tasks iteratively deepen our understanding of what's going on in an image.00:14In the same order, classification tells us what's in the image.00:18Detection tells us where it is approximately, and segmentation precisely tells us where00:23it is.00:24Now, an even more complex step would be to represent this image in the real world.00:29In other words, it would be to represent an object taken from an image or video into a00:343D surface, just like GANverse3D can do for inanimate objects, as I showed in a recent00:41video.00:42This demonstrates a deep understanding of the image or video by the model, representing00:46the complete shape of an object, which is why it is such a complex task.00:51Even more challenging is to do the same thing on nonrigid shapes.00:55Or rather, on humans and animals, objects that can be weirdly shaped and even deformed01:00to a certain extent.01:02This task of generating a 3D model based on a video or images is called 3D reconstruction,01:08and Google Research, along with Carnegie Mellon University just published a paper called LASR:01:14Learning Articulated Shape Reconstruction from a Monocular Video.01:18As the name says, this is a new method for generating 3D models of humans or animals01:23moving from only a short video as input.01:26Indeed, it actually understands that this is an odd shape, that it can move, but still01:31needs to stay attached as this is still one "object" and not just many objects together.01:37Typically, 3D modeling techniques needed data prior.01:40In this case, the data prior was an approximate shape of the complex objects, which looks01:45like this...01:46As you can see, it had to be quite similar to the actual human or animal, which is not01:51very intelligent.01:52With LASR, you can produce even better results.01:55With no prior at all, it starts with just a plain sphere whatever the object to reconstruct.02:01You can imagine what this means for generalizability and how powerful this can be when you don't02:06have to explicitly tell the network both what the object is and how it "typically" looks.02:11This is a significant step forward!02:13But how does it work?02:15As I said, it only needs a video, but there are still some pre-processing steps to do.02:19Don't worry.02:20These steps are quite well-understood in computer vision.02:22As you may recall, I mentioned image segmentation at the beginning of the video.02:27We need this segmentation of an object that can be done easily using a trained neural02:30network.02:31Then, we need the optical flow for each frame, which is the motion of objects between consecutive02:37frames of the video.02:38This is also easily found using computer vision techniques and improved with neural networks,02:43as I covered not even a year ago on my channel.02:46They start the rendering process with a sphere assuming it is a rigid object, so an object02:51that does not have articulations.02:53With this assumption, they optimize the shape and the camera viewpoint understanding of02:57their model iteratively for 20 epochs.03:00This rigid assumption is shown here with the number of bones equal to zero, meaning that03:05nothing can move separately.03:07Then, we get back to real life, where the human is not rigid.03:11Now, the goal is to have an accurate 3d model that can move realistically.03:16This is achieved by increasing the number of bones and vertices to make the model more03:20and more precise.03:21Here the vertices are 3-dimensional pixels where the lines and volumes of the rendered03:26object connect, and the bones are, well, they are basically bones.03:30These bones are all the parts of the objects that move during the video with either translations03:35or rotations.03:36Both the bones and vertices are incrementally augmented until we reach stage 3, where the03:41model has learned to generate a pretty accurate render of the current object.03:45Here, they also need a model to render this object, which is called a differentiable renderer.03:50I won't dive into how it works as I already covered it in previous videos, but basically,03:55it is a model able to create a 3-dimensional representation of an object.03:59It has the particularity to be differentiable.04:03Meaning that you can train this model in a similar way as a typical neural network with04:07back-propagation.04:08Here, everything is trained together, optimizing the results following the four stages we just04:13saw improving the rendered result at each stage.04:17The model then learns just like any other machine learning model using gradient descent04:22and updating the model's parameters based on the difference between the rendered output04:27and the ground-truth video measurements.04:29So it doesn't even need to see a ground-truth version of the rendered object.04:34It only needs the video, segmentation, and optical flow results to learn by transforming04:39back the rendered object into a segmented image and its optical flow and comparing it04:44to the input.04:45What is even better is that all this is done in a self-supervised learning process.04:50Meaning that you give the model the videos to train on with their corresponding segmentation04:55and optical flow results, and it iteratively learns to render the objects during training.05:00No annotations are needed at all!05:03And Voilà, you have your complex 3D renderer without any special training or ground truth05:08needed!05:09If gradient descent, epoch, parameters, or self-supervised learning are still unclear05:14concepts to you, I invite you to watch the series of short videos I made explaining the05:18basics of machine learning.05:20As always, the full article is available on my website louisbouchard.ai, with many other05:25great papers explained and information.05:28Thank you for watching.

L O A D I N G
. . . comments & more!