Estimating the 3D pose and shape of a person is a crucial step in reasoning about interactions. Having the ability to do so from RGB images helps us avoid the need for expensive MoCap setups or specialized instrumentation.
In SMPLify [ ] the shape and 3D pose of a person are estimated by minimizing the error between the projected 3D joints of the SMPL model and 2D detected landmarks. Unite the People [ ] adds a new loss term, creates a pseudo ground-truth dataset and trains discriminative models for detailed 2D landmark detection and 3D pose estimation. The whole process is repeated multiple times to refine the results and increase the quantity and quality of available data. In [ ] the optimization pipeline of SMPLify is extended to handle multi-view videos. Using temporal information helps resolve left/right ambiguities, get a better estimate of the global orientation and of the shape of the body.
Human Mesh Recovery [ ] regresses the shape and the 3D pose directly from a single RGB image, by minimizing the 3D SMPL keypoints reprojection error. To further constrain the problem, an adversarial loss is employed, which forces the model to produce SMPL parameters that the discriminator is unable to distinguish from real ones drawn from a database of 3D human meshes. An advantage of this approach is that it can be trained without any expensive paired 2D-to-3D data.
In Neural Body Fitting [ ] the shape and 3D pose parameters of SMPL are regressed from body part segmentations given by an intermediate network. Since the whole pipeline is differentiable, different types of supervision can be used, depending on the available information. Extensive experiments show that the body part segmentation is a better representation for lifting to 3D, as well as that competitive performace can be achieved with limited paired 2D-to-3D data.