visit
Skeleton-based model consists of a set of joints (keypoints) like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body. This model is used both in 2D and 3D human pose estimation techniques because of its flexibility.
Contour-based model consists of the contour and rough width of the body torso and limbs, where body parts are presented with boundaries and rectangles of a person’s silhouette.
Volume-based model consists of 3D human body shapes and poses represented by volume-based models with geometric meshes and shapes, normally captured with 3D scans.
Here, I am talking about skeleton-based models, which may be detected from a 2D or 3D perspective.
2D pose estimation is based on the detection and analysis of X, Y coordinates of human body joints from an RGB image.
3D pose estimation is based on the detection and analysis of X, Y, Z coordinates of human body joints from an RGB image.
When speaking about fitness applications involving human pose estimation, it’s better to use 3D estimation, since it analyzes human poses during physical activities more accurately. Talking about AI fitness coach apps, the common flow looks as follows:We made the analysis of existing models and figured out that is the most optimal choice for fitness app purposes. In the input, it should have a set of 2D keypoints detected, where the COCO 2017 dataset is applied as a pre-trained 2D detector. For the accurate prediction of a current joint’s position, it processes visual data from several frames captured at various periods of time.
One more way is to ask the user to indicate the start and the end of the exercise performance manually.2. Detecting 2D and 3D keypoints on the user’s body3. Decomposing of the exercise phasesWhen having the positions of keypoints (joints) extracted, they should be compared with the reference video’s positions. However, we cannot make a direct comparison because the exercise performance speed and the total number of repetitions on the input and reference videos may differ.These discrepancies can be resolved by decomposition of an exercise into phases. We can see how it is illustrated in the image below, where the squatting exercise is decomposed into two primary phases: squatting down and squatting up.
Photo source: stronglifts.com
The decomposition can be done through the analysis of keypoints detected from the input video frame by frame, and then comparing them by certain criteria with the keypoints from the reference video.4. Searching for common mistakesWhen 3D keypoints and certain phases of an exercise are detected, it’s time to detect common mistakes in an exercise technique in the input video. For example, in squatting, we can detect moments when the legs are bent (not straight) and the knees are closer to the center torso than feet.5. Comparing the input video frames with the reference onesHere we should take a reference video, where the exercise is performed correctly, split it into phases, and detect keypoints in each frame. When the keypoints are detected and exercise phases defined in both input and reference videos, we can compare each phase of an exercise performed by a user and professional athlete.The step-by-step flow looks as follows:a. Slow down/accelerate the reference video in order to match the speed of the input one.b. Align both skeleton models of the user and a professional athlete so that their rotation angle and origins match.c. Normalize the size of both skeletons since reference and input videos can be captured from different distances.d. Compare keypoints frame by frame and detect motion inconsistencies.e. Repeat the flow separately for different groups of joints (e. g. feet position, knee position, hands and elbows position, etc.).6. Display results and generate recommendations for a userWhen the whole analysis cycle is completed, the user will get results displayed in different formats. For example, the output may include interactive 3D reconstructions with mistake hints, so that the user can zoom in/out, go back, forward, or pause at a specific moment. It is also possible to collect and display movement statistics such as the number of repetitions, average speed and duration of one repetition, and others.Visually the 3D human pose estimation system based on videos looks like as follows:
Photo sources: stronglifts.com,
In this article, I described how a 3D human pose estimation system works from the perspective of AI fitness coach app development because it illustrates well how it might work by example. But please note that the flow might be changed depending on business requirements or other factors.Highlights:Written by Maksym Tatariants, Data Science Engineer, . This article is based on our technology research and experience providing software development services.
Previously published at