An expressive model of human motion is essential for action classification, motion prediction and synthesis. To that end, we are exploring several deep network architectures to predict human movement.
Current methods for motion prediction typically do not work for a wide range of actions and suffer from "regression to the mean". We show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not model motion at all. We investigate this and propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction [ ].
We have also shown that a simple encoder/decoder architecture that takes a set of past poses and predicts a set of future poses works well and is simpler than RNN models. By forcing the encoding through a bottleneck, the approach learns features of human movement that are useful for action recognition. Our feed-forward networks outperform recurrent approaches for short- and long-term predictions and generalize to novel subjects and actions [ ].
We have worked on several methods to estimate 3D pose from 2D joints. We show that this can actually be solved with a very simple network that outperforms previous, more complex, methods by a substantial margin. This suggests that "lifting" from 2D to 3D is not the really hard problem but, rather, that extracting the relevant information from the 2D image is the key [ ].
Neural networks, however, may not generalize to scenarios that they have never seen -- imagine someone floating in zero gravity. Hence we also explore physics-based controllers of human movement [ ]. We envision a future that combines the best of both approaches with learned models of behavior combined with physical constraints coming from environmental interaction.