We, first, address the problems of large scale image classification. We present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We show and interpret the importance of an appropriate vector normalization.
Furthermore, we discuss how to learn given a large number of classes and images with stochastic gradient descent and show results on ImageNet10k. We, then, present a weakly supervised approach for learning human actions modeled as interactions between humans and objects.
Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated (only) with the action label.
Finally, we present work on learning object detectors from realworld web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects in a set of videos of the class and learns a detector for it. The approach extracts candidate spatio-temporal tubes based on motion segmentation and then selects one tube per video jointly over all videos.