I am a PhD student at the University of Bonn and the Max Planck Institute for Intelligent Systems in Tuebingen working with Dr. Juergen Gall on Object Detection.
Detecting objects in images is a crucial aspect of computer vision and my work is towards modeling household objects during their interaction with humans. This problem is challenging due to poor resolution, occlusion and articulations. However, the presence of a human is a useful cue that can be utilized towards solving the detection problem efficiently.
International Journal of Computer Vision (IJCV), 118(2):172-193, June 2016 (article)
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.
British Machine Vision Conference, September 2015 (conference)
Detecting small objects in images is a challenging problem particularly when they are often occluded by hands or other body parts.
Recently, joint modelling of human pose and objects has been proposed to improve both pose estimation as well as object detection.
These approaches, however, focus on explicit interaction with an object and lack the flexibility to combine both modalities when interaction is not obvious.
We therefore propose to use human pose as an additional context information for object detection.
To this end, we represent an object category by a tree model and train regression forests that localize parts of an object for each modality separately.
Predictions of the two modalities are then combined to detect the bounding box of the object.
We evaluate our approach on three challenging datasets which vary in the amount of object interactions and the quality of automatically extracted human poses.
International Conference on Image Processing, pages: 1653-1657, Paris, France, October 2014 (conference)
Hough-based voting approaches have been successfully applied to object detection. While these methods can be efficiently implemented by random forests, they estimate the probability for an object hypothesis for each feature independently. In this work, we address this problem by grouping features in a local neighborhood to obtain a better estimate of the probability. To this end, we propose oblique classification-regression forests that combine features of different trees. We further investigate the benefit of combining independent and grouped features and evaluate the approach on RGB and RGB-D datasets.
In German Conference on Pattern Recognition (GCPR), pages: 1-13, Lecture Notes in Computer Science, Springer, September 2014 (inproceedings)
Hand motion capture has been an active research topic in recent years, following the success of full-body pose tracking. Despite similarities, hand tracking proves to be more challenging, characterized by a higher dimensionality, severe occlusions and self-similarity between fingers.
For this reason, most approaches rely on strong assumptions, like hands in isolation or expensive multi-camera systems, that limit the practical use. In this work, we propose a framework for hand tracking that can capture the motion of two interacting hands using only a single, inexpensive RGB-D camera. Our approach combines a generative model with collision detection and discriminatively learned salient points. We quantitatively evaluate our approach on 14 new sequences with challenging interactions.
In European Conference on Computer Vision, 8694, pages: 415-430, Lecture Notes in Computer Science, (Editors: D. Fleet and T. Pajdla and B. Schiele and T. Tuytelaars ), Springer International Publishing, September 2014 (inproceedings)
In order to avoid an expensive manual labeling process or to learn object classes autonomously without human intervention, object discovery techniques have been proposed that extract visual similar objects from weakly labelled videos. However, the problem of discovering small or medium sized objects is largely unexplored. We observe that videos with activities involving human-object interactions can serve as weakly labelled data for such cases. Since neither object appearance nor motion is distinct enough to discover objects in these videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of object proposals. Furthermore, we model similarity of objects based on appearance and functionality, which is derived from human and object motion. We show that functionality is an
important cue for discovering objects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems