International Journal of Computer Vision (IJCV), 118(2):172-193, June 2016 (article)
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.
European Conference on Computer Vision Workshops 2016 (ECCVW’16) - Workshop on Recovering 6D Object Pose (R6D’16), pages: 620-633, Springer International Publishing, 2016 (proceedings)
Although commercial and open-source software exist to reconstruct a static object from a sequence recorded with an RGB-D sensor, there is a lack of tools that build rigged models of articulated objects that deform realistically and can be used for tracking or animation.
In this work, we fill this gap and propose a method that creates a fully rigged model of an articulated object from depth data of a single sensor.
To this end, we combine deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.
The fully rigged model then consists of a watertight mesh, embedded skeleton, and skinning weights.
In International Conference on Computer Vision (ICCV), pages: 729-737, December 2015 (inproceedings)
Recent advances have enabled 3d object reconstruction approaches using a single off-the-shelf RGB-D camera. Although these approaches are successful for a wide range of object classes, they rely on stable and distinctive geometric or texture features. Many objects like mechanical parts, toys, household or decorative articles, however, are textureless and characterized by minimalistic shapes that are simple and symmetric. Existing in-hand scanning systems and 3d reconstruction techniques fail for such symmetric objects in the absence of highly distinctive features. In this work, we show that extracting 3d hand motion for in-hand scanning effectively facilitates the reconstruction of even featureless and highly symmetric objects and we present an approach that fuses the rich additional information of hands into a 3d reconstruction pipeline, significantly contributing to the state-of-the-art of in-hand scanning.
British Machine Vision Conference, September 2015 (conference)
Detecting small objects in images is a challenging problem particularly when they are often occluded by hands or other body parts.
Recently, joint modelling of human pose and objects has been proposed to improve both pose estimation as well as object detection.
These approaches, however, focus on explicit interaction with an object and lack the flexibility to combine both modalities when interaction is not obvious.
We therefore propose to use human pose as an additional context information for object detection.
To this end, we represent an object category by a tree model and train regression forests that localize parts of an object for each modality separately.
Predictions of the two modalities are then combined to detect the bounding box of the object.
We evaluate our approach on three challenging datasets which vary in the amount of object interactions and the quality of automatically extracted human poses.
International Conference on Image Processing, pages: 1653-1657, Paris, France, October 2014 (conference)
Hough-based voting approaches have been successfully applied to object detection. While these methods can be efficiently implemented by random forests, they estimate the probability for an object hypothesis for each feature independently. In this work, we address this problem by grouping features in a local neighborhood to obtain a better estimate of the probability. To this end, we propose oblique classification-regression forests that combine features of different trees. We further investigate the benefit of combining independent and grouped features and evaluate the approach on RGB and RGB-D datasets.
In German Conference on Pattern Recognition (GCPR), pages: 1-13, Lecture Notes in Computer Science, Springer, September 2014 (inproceedings)
Hand motion capture has been an active research topic in recent years, following the success of full-body pose tracking. Despite similarities, hand tracking proves to be more challenging, characterized by a higher dimensionality, severe occlusions and self-similarity between fingers.
For this reason, most approaches rely on strong assumptions, like hands in isolation or expensive multi-camera systems, that limit the practical use. In this work, we propose a framework for hand tracking that can capture the motion of two interacting hands using only a single, inexpensive RGB-D camera. Our approach combines a generative model with collision detection and discriminatively learned salient points. We quantitatively evaluate our approach on 14 new sequences with challenging interactions.
In European Conference on Computer Vision, 8694, pages: 415-430, Lecture Notes in Computer Science, (Editors: D. Fleet and T. Pajdla and B. Schiele and T. Tuytelaars ), Springer International Publishing, September 2014 (inproceedings)
In order to avoid an expensive manual labeling process or to learn object classes autonomously without human intervention, object discovery techniques have been proposed that extract visual similar objects from weakly labelled videos. However, the problem of discovering small or medium sized objects is largely unexplored. We observe that videos with activities involving human-object interactions can serve as weakly labelled data for such cases. Since neither object appearance nor motion is distinct enough to discover objects in these videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of object proposals. Furthermore, we model similarity of objects based on appearance and functionality, which is derived from human and object motion. We show that functionality is an
important cue for discovering objects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.
In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 3041-3048, IEEE, Portland, OR, USA, June 2013 (inproceedings)
In this work, we address the problem of estimating 2d human pose from still images. Recent methods that rely on discriminatively trained deformable parts organized in a tree model have shown to be very successful in solving this task. Within such a pictorial structure framework, we address the problem of obtaining good part templates by proposing novel, non-linear joint regressors. In particular, we employ two-layered random forests as joint regressors. The first layer acts as a discriminative, independent body part classifier. The second layer takes the estimated class distributions of the first one into account and is thereby able to predict joint locations by modeling the interdependence and co-occurrence of the parts. This results in a pose estimation framework that takes dependencies between body parts already for joint localization into account and is thus able to circumvent typical ambiguities of tree structures, such as for legs and arms. In the experiments, we demonstrate that our body parts dependent joint regressors achieve a higher joint localization accuracy than tree-based state-of-the-art methods.
In German Conference on Pattern Recognition (GCPR), 8142, pages: 131-141, Lecture Notes in Computer Science, (Editors: Weickert, Joachim and Hein, Matthias and Schiele, Bernt), Springer, 2013 (inproceedings)
Benchmarking methods for 3d hand tracking is still an open problem due to the difficulty of acquiring ground truth data.
We introduce a new dataset and benchmarking protocol that is insensitive to the accumulative error of other protocols.
To this end, we create testing frame pairs of increasing difficulty and measure the pose estimation error separately for each of them.
This approach gives new insights and allows to accurately study the performance of each feature or method without employing a full tracking pipeline.
Following this protocol, we evaluate various directional distances in the context of silhouette-based 3d hand tracking, expressed as special cases of a generalized Chamfer distance form.
An appropriate parameter setup is proposed for each of them, and a comparative study reveals the best performing method in this context.
Fanelli, G., Dantone, M., Gall, J., Fossati, A., van Gool, L.
International Journal of Computer Vision, 101(3):437-458, Springer, 2013 (article)
We present a random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D. Our algorithm takes a voting approach, where each patch extracted from the depth image can directly cast a vote for the head pose or each of the facial features. Our system proves capable of handling large rotations, partial occlusions, and the noisy depth data acquired using commercial sensors. Moreover, the algorithm works on each frame independently and achieves real time performance without resorting to parallel computations on a GPU. We present extensive experiments on publicly available, challenging datasets and present a new annotated head pose database recorded using a Microsoft Kinect.
Transactions on Pattern Analysis and Machine Intelligence, 35(11):2720-2735, 2013 (article)
Capturing the skeleton motion and detailed time-varying surface geometry of multiple, closely interacting peoples is a very challenging task, even in a multicamera setup, due to frequent occlusions and ambiguities in feature-to-person assignments. To address this task, we propose a framework that exploits multiview image segmentation. To this end, a probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Given the articulated template models of each person and the labeled pixels, a combined optimization scheme, which splits the skeleton pose optimization problem into a local one and a lower dimensional global one, is applied one by one to each individual, followed with surface estimation to capture detailed nonrigid deformations. We show on various sequences that our approach can capture the 3D motion of humans accurately even if they move rapidly, if they wear wide apparel, and if they are engaged in challenging multiperson motions, including dancing, wrestling, and hugging.
In Outdoor and Large-Scale Real-World Scene Analysis, 7474, pages: 305-328, LNCS, (Editors: Dellaert, Frank and Frahm, Jan-Michael and Pollefeys, Marc and Rosenhahn, Bodo and Leal-Taix’e, Laura), Springer, 2012 (incollection)
Pellegrini, S., Gall, J., Sigal, L., van Gool, L.
Destination Flow for Crowd Simulation
In Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams, 7585, pages: 162-171, LNCS, Springer, 2012 (inproceedings)
Yao, A., Gall, J., Leistner, C., van Gool, L.
Interactive Object Detection
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 3242-3249, IEEE, Providence, RI, USA, 2012 (inproceedings)
In Outdoor and Large-Scale Real-World Scene Analysis, 7474, pages: 243-263, LNCS, (Editors: Dellaert, Frank and Frahm, Jan-Michael and Pollefeys, Marc and Rosenhahn, Bodo and Leal-Taix’e, Laura), Springer, 2012 (incollection)
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems