Much of the field has focused on estimating 2D joints, 3D joints, or the skeleton of the body. We focus on estimating the full 3D shape and pose. This is crucial for reasoning about interactions. Having the ability to do so from RGB images enables markerless motion capture and provides the foundation for human behavior analysis. We explore two strategies: classical top-down model fitting and feed-forward regression.
SMPLify [ ] combines bottom-up 2D feature detection with top-down 3D model fitting. The shape and 3D pose of a person are estimated by minimizing the error between the projected 3D joints of the SMPL model and 2D detected landmarks. Unite the People [ ] adds a new loss term, creates a pseudo ground-truth dataset and trains discriminative models for detailed 2D landmark detection and 3D pose estimation. The whole process is repeated multiple times to refine the results and increase the quantity and quality of available data. In [ ] the optimization pipeline of SMPLify is extended to handle multi-view imagery and video. Using temporal information helps resolve left/right ambiguities while giving better estimates of global orientation and body shape.
Human Mesh Recovery [ ] learns to regresses the shape and the 3D pose directly from a single RGB image, by minimizing the reprojection error of 3D SMPL keypoints during training. This is not sufficient though, so we add an adversarial loss that forces the model to produce SMPL parameters that the discriminator is unable to distinguish from real ones drawn from a database of 3D human meshes. An advantage of this approach is that it can be trained without any expensive paired 2D-to-3D data.
In Neural Body Fitting [ ] the shape and 3D pose parameters of SMPL are regressed from body part segmentations given by an intermediate network. Since the whole pipeline is differentiable, different types of supervision can be used, depending on the available information. Extensive experiments show that the body part segmentation is a good intermediate representation for lifting to 3D, as well as that competitive performance can be achieved with limited paired 2D-to-3D data.