The amount of digital video content available is growing daily, on sites such as YouTube. Recent statistics on the YouTube website show that around 48 hours of video are uploaded every minute. This massive data production calls for automatic analysis.
In this talk we present some recent results for action recognition in videos. Bag-of-features have shown very good performance for action recognition in videos. We briefly review the underlying principles and introduce trajectory-based video features, which have shown to outperform the state of the art. These trajectory features are obtained by dense point sampling and tracking based on displacement information from a dense optical flow field. Trajectory descriptors are obtained with motion boundary histograms, which are robust to camera motion. We, then, show how to integrate temporal structure into a bag-of-features based on an actom sequence model. Action sequence models localize actions based on sequences of atomic actions, i.e., represent the temporal structure by sequences of histograms of actom-anchored visual features. This representation is flexible, sparse and discriminative. The resulting actom sequence model is shown to significantly improve performance over existing methods for temporal action localization.
Finally, we show how to move towards more structured representations by explicitly modeling human-object interactions. We learn how to represent human actions as interactions between persons and objects. We localize in space and track over time both the object and the person, and represent an action as the trajectory of the object with respect to the person position, i.e., our human-object interaction features capture the relative trajectory of the object with respect to the human. This is joint work with A Gaidon, V. Ferrari, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang.