Header logo is ps


2019


Thumb xl website teaser
Resolving 3D Human Pose Ambiguities with 3D Scene Constraints

Hassan, M., Choutas, V., Tzionas, D., Black, M. J.

In International Conference on Computer Vision, October 2019 (inproceedings)

Abstract
To understand and analyze human behavior, we need to capture humans moving in, and interacting with, the world. Most existing methods perform 3D human pose estimation without explicitly considering the scene. We observe however that the world constrains the body and vice-versa. To motivate this, we show that current 3D human pose estimation methods produce results that are not consistent with the 3D scene. Our key contribution is to exploit static 3D scene structure to better estimate human pose from monocular images. The method enforces Proximal Relationships with Object eXclusion and is called PROX. To test this, we collect a new dataset composed of 12 different 3D scenes and RGB sequences of 20 subjects moving in and interacting with the scenes. We represent human pose using the 3D human body model SMPL-X and extend SMPLify-X to estimate body pose using scene constraints. We make use of the 3D scene information by formulating two main constraints. The interpenetration constraint penalizes intersection between the body model and the surrounding 3D scene. The contact constraint encourages specific parts of the body to be in contact with scene surfaces if they are close enough in distance and orientation. For quantitative evaluation we capture a separate dataset with 180 RGB frames in which the ground-truth body pose is estimated using a motion-capture system. We show quantitatively that introducing scene constraints significantly reduces 3D joint error and vertex error. Our code and data are available for research at https://prox.is.tue.mpg.de.

pdf link (url) [BibTex]

2019

pdf link (url) [BibTex]


Thumb xl end to end learning for graph decomposition
End-to-end Learning for Graph Decomposition

Song, J., Andres, B., Black, M., Hilliges, O., Tang, S.

In International Conference on Computer Vision, October 2019 (inproceedings)

Abstract
Deep neural networks provide powerful tools for pattern recognition, while classical graph algorithms are widely used to solve combinatorial problems. In computer vision, many tasks combine elements of both pattern recognition and graph reasoning. In this paper, we study how to connect deep networks with graph decomposition into an end-to-end trainable framework. More specifically, the minimum cost multicut problem is first converted to an unconstrained binary cubic formulation where cycle consistency constraints are incorporated into the objective function. The new optimization problem can be viewed as a Conditional Random Field (CRF) in which the random variables are associated with the binary edge labels. Cycle constraints are introduced into the CRF as high-order potentials. A standard Convolutional Neural Network (CNN) provides the front-end features for the fully differentiable CRF. The parameters of both parts are optimized in an end-to-end manner. The efficacy of the proposed learning algorithm is demonstrated via experiments on clustering MNIST images and on the challenging task of real-world multi-people pose estimation.

PDF [BibTex]

PDF [BibTex]


Thumb xl ps web
Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture from Images "In the Wild"

Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M. J.

In International Conference on Computer Vision, October 2019 (inproceedings)

Abstract
We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. The Grevy's zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. Going beyond state-of-the-art methods for human shape and pose estimation, our method learns a shape space for zebras during training. Learning such a shape space from images using only a photometric loss is novel, and the approach can be used to learn shape in other settings with limited 3D supervision. Moreover, we couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image. We show that the predicted texture map allows a novel per-instance unsupervised optimization over the network features. This method, SMALST (SMAL with learned Shape and Texture) goes beyond previous work, which assumed manual keypoints and/or segmentation, to regress directly from pixels to 3D animal shape, pose and texture. Code and data are available at https://github.com/silviazuffi/smalst

pdf supmat [BibTex]

pdf supmat [BibTex]


Thumb xl aircap cover image
Markerless Outdoor Human Motion Capture Using Multiple Autonomous Micro Aerial Vehicles

Saini, N., Price, E., Tallamraju, R., Enficiaud, R., Ludwig, R., Martinović, I., Ahmad, A., Black, M.

In International Conference on Computer Vision, October 2019 (inproceedings) Accepted

Abstract
Capturing human motion in natural scenarios means moving motion capture out of the lab and into the wild. Typical approaches rely on fixed, calibrated, cameras and reflective markers on the body, significantly limiting the motions that can be captured. To make motion capture truly unconstrained, we describe the first fully autonomous outdoor capture system based on flying vehicles. We use multiple micro-aerial-vehicles(MAVs), each equipped with a monocular RGB camera, an IMU, and a GPS receiver module. These detect the person, optimize their position, and localize themselves approximately. We then develop a markerless motion capture method that is suitable for this challenging scenario with a distant subject, viewed from above, with approximately calibrated and moving cameras. We combine multiple state-of-the-art 2D joint detectors with a 3D human body model and a powerful prior on human pose. We jointly optimize for 3D body pose and camera pose to robustly fit the 2D measurements. To our knowledge, this is the first successful demonstration of outdoor, full-body, markerless motion capture from autonomous flying vehicles.

Project Page [BibTex]


Thumb xl amass
AMASS: Archive of Motion Capture as Surface Shapes

Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., Black, M. J.

In International Conference on Computer Vision, October 2019 (inproceedings)

Abstract
Large datasets are the cornerstone of recent advances in computer vision using deep learning. In contrast, existing human motion capture (mocap) datasets are small and the motions limited, hampering progress on learning models of human motion. While there are many different datasets available, they each use a different parameterization of the body, making it difficult to integrate them into a single meta dataset. To address this, we introduce AMASS, a large and varied database of human motion that unifies 15 different optical marker-based mocap datasets by representing them within a common framework and parameterization. We achieve this using a new method, MoSh++, that converts mocap data into realistic 3D human meshes represented by a rigged body model. Here we use SMPL [26], which is widely used and provides a standard skeletal representation as well as a fully rigged surface mesh. The method works for arbitrary marker-sets, while recovering soft-tissue dynamics and realistic hand motion. We evaluate MoSh++ and tune its hyper-parameters using a new dataset of 4D body scans that are jointly recorded with marker-based mocap. The consistent representation of AMASS makes it readily useful for animation, visualization, and generating training data for deep learning. Our dataset is significantly richer than previous human motion collections, having more than 40 hours of motion data, spanning over 300 subjects, more than 11000 motions, and is available for research at https://amass.is.tue.mpg.de/.

arXiv pdf supmat [BibTex]


Thumb xl sap
The Influence of Visual Perspective on Body Size Estimation in Immersive Virtual Reality

Thaler, A., Pujades, S., Stefanucci, J. K., Creem-Regehr, S. H., Tesch, J., Black, M. J., Mohler, B. J.

In ACM Symposium on Applied Perception, September 2019 (inproceedings)

Abstract
The creation of realistic self-avatars that users identify with is important for many virtual reality applications. However, current approaches for creating biometrically plausible avatars that represent a particular individual require expertise and are time-consuming. We investigated the visual perception of an avatar’s body dimensions by asking males and females to estimate their own body weight and shape on a virtual body using a virtual reality avatar creation tool. In a method of adjustment task, the virtual body was presented in an HTC Vive head-mounted display either co-located with (first-person perspective) or facing (third-person perspective) the participants. Participants adjusted the body weight and dimensions of various body parts to match their own body shape and size. Both males and females underestimated their weight by 10-20% in the virtual body, but the estimates of the other body dimensions were relatively accurate and within a range of ±6%. There was a stronger influence of visual perspective on the estimates for males, but this effect was dependent on the amount of control over the shape of the virtual body, indicating that the results might be caused by where in the body the weight changes expressed themselves. These results suggest that this avatar creation tool could be used to allow participants to make a relatively accurate self-avatar in terms of adjusting body part dimensions, but not weight, and that the influence of visual perspective and amount of control needed over the body shape are likely gender-specific.

pdf [BibTex]

pdf [BibTex]


Thumb xl lala2
Learning to Train with Synthetic Humans

Hoffmann, D. T., Tzionas, D., Black, M. J., Tang, S.

In German Conference on Pattern Recognition (GCPR), September 2019 (inproceedings)

Abstract
Neural networks need big annotated datasets for training. However, manual annotation can be too expensive or even unfeasible for certain tasks, like multi-person 2D pose estimation with severe occlusions. A remedy for this is synthetic data with perfect ground truth. Here we explore two variations of synthetic data for this challenging problem; a dataset with purely synthetic humans, as well as a real dataset augmented with synthetic humans. We then study which approach better generalizes to real data, as well as the influence of virtual humans in the training loss. We observe that not all synthetic samples are equally informative for training, while the informative samples are different for each training stage. To exploit this observation, we employ an adversarial student-teacher framework; the teacher improves the student by providing the hardest samples for its current state as a challenge. Experiments show that this student-teacher framework outperforms all our baselines.

pdf suppl poster link (url) [BibTex]

pdf suppl poster link (url) [BibTex]


Thumb xl autonomous mocap cover image new
Active Perception based Formation Control for Multiple Aerial Vehicles

Tallamraju, R., Price, E., Ludwig, R., Karlapalem, K., Bülthoff, H. H., Black, M. J., Ahmad, A.

IEEE Robotics and Automation Letters, Robotics and Automation Letters, IEEE, August 2019 (article) Accepted

Abstract
We present a novel robotic front-end for autonomous aerial motion-capture (mocap) in outdoor environments. In previous work, we presented an approach for cooperative detection and tracking (CDT) of a subject using multiple micro-aerial vehicles (MAVs). However, it did not ensure optimal view-point configurations of the MAVs to minimize the uncertainty in the person's cooperatively tracked 3D position estimate. In this article, we introduce an active approach for CDT. In contrast to cooperatively tracking only the 3D positions of the person, the MAVs can actively compute optimal local motion plans, resulting in optimal view-point configurations, which minimize the uncertainty in the tracked estimate. We achieve this by decoupling the goal of active tracking into a quadratic objective and non-convex constraints corresponding to angular configurations of the MAVs w.r.t. the person. We derive this decoupling using Gaussian observation model assumptions within the CDT algorithm. We preserve convexity in optimization by embedding all the non-convex constraints, including those for dynamic obstacle avoidance, as external control inputs in the MPC dynamics. Multiple real robot experiments and comparisons involving 3 MAVs in several challenging scenarios are presented.

pdf Project Page [BibTex]

pdf Project Page [BibTex]


Thumb xl cover
Motion Planning for Multi-Mobile-Manipulator Payload Transport Systems

Tallamraju, R., Salunkhe, D., Rajappa, S., Ahmad, A., Karlapalem, K., Shah, S. V.

In 15th IEEE International Conference on Automation Science and Engineering, IEEE, August 2019 (inproceedings) Accepted

[BibTex]

[BibTex]


Thumb xl teaser results
Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation

Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M. J.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019 (inproceedings)

Abstract
We address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions. Our key insight is that these four fundamental vision problems are coupled through geometric constraints. Consequently, learning to solve them together simplifies the problem because the solutions can reinforce each other. We go beyond previous work by exploiting geometry more explicitly and segmenting the scene into static and moving regions. To that end, we introduce Competitive Collaboration, a framework that facilitates the coordinated training of multiple specialized neural networks to solve complex problems. Competitive Collaboration works much like expectation-maximization, but with neural networks that act as both competitors to explain pixels that correspond to static or moving regions, and as collaborators through a moderator that assigns pixels to be either static or independently moving. Our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects. Our model is trained without any supervision and achieves state-of-the-art performance among joint unsupervised methods on all sub-problems.

Paper link (url) Project Page Project Page [BibTex]


Thumb xl cvpr2019 demo v2.001
Local Temporal Bilinear Pooling for Fine-grained Action Parsing

Zhang, Y., Tang, S., Muandet, K., Jarvers, C., Neumann, H.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019 (inproceedings)

Abstract
Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations in a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to other work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced with neither information loss nor extra computation. We perform intensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art work on various datasets.

Code video demo pdf link (url) [BibTex]

Code video demo pdf link (url) [BibTex]


Thumb xl ringnet
Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

Sanyal, S., Bolkart, T., Feng, H., Black, M. J.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019 (inproceedings)

Abstract
The estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, expression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape. To train a network without any 2D-to-3D supervision, we present RingNet, which learns to compute 3D face shape from a single image. Our key observation is that an individual’s face shape is constant across images, regardless of expression, pose, lighting, etc. RingNet leverages multiple images of a person and automatically detected 2D face features. It uses a novel loss that encourages the face shape to be similar when the identity is the same and different for different people. We achieve invariance to expression by representing the face using the FLAME model. Once trained, our method takes a single image and outputs the parameters of FLAME, which can be readily animated. Additionally we create a new database of faces “not quite in-the-wild” (NoW) with 3D head scans and high-resolution images of the subjects in a wide variety of conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods that use 3D supervision. The dataset, model, and results are available for research purposes.

code pdf preprint link (url) Project Page [BibTex]


Thumb xl obman new
Learning Joint Reconstruction of Hands and Manipulated Objects

Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M. J., Laptev, I., Schmid, C.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019 (inproceedings)

Abstract
Estimating hand-object manipulations is essential for interpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challenging task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact restricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work, we regularize the joint reconstruction of hands and objects with manipulation constraints. We present an end-to-end learnable model that exploits a novel contact loss that favors physically plausible hand-object constellations. Our approach improves grasp quality metrics over baselines, using RGB images as input. To train and evaluate the model, we also propose a new large-scale synthetic dataset, ObMan, with hand-object manipulations. We demonstrate the transferability of ObMan-trained models to real data.

pdf suppl poster link (url) Project Page Project Page [BibTex]

pdf suppl poster link (url) Project Page Project Page [BibTex]


Thumb xl smplex
Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., Black, M. J.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019 (inproceedings)

Abstract
To facilitate the analysis of human actions, interactions and emotions, we compute a 3D model of human body pose, hand pose, and facial expression from a single monocular image. To achieve this, we use thousands of 3D scans to train a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with fully articulated hands and an expressive face. Learning to regress the parameters of SMPL-X directly from images is challenging without paired images and 3D ground truth. Consequently, we follow the approach of SMPLify, which estimates 2D features and then optimizes model parameters to fit the features. We improve on SMPLify in several significant ways: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild. We evaluate 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. This is a step towards automatic expressive human capture from monocular RGB data. The models, code, and data are available for research purposes at https://smpl-x.is.tue.mpg.de.

video code pdf suppl poster link (url) Project Page [BibTex]


Thumb xl voca
Capture, Learning, and Synthesis of 3D Speaking Styles

Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M. J.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019 (inproceedings)

Abstract
Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input—even speech in languages other than English—and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.

code Project Page video paper [BibTex]

code Project Page video paper [BibTex]


Thumb xl hessepami
Learning and Tracking the 3D Body Shape of Freely Moving Infants from RGB-D sequences

Hesse, N., Pujades, S., Black, M., Arens, M., Hofmann, U., Schroeder, S.

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019 (article)

Abstract
Statistical models of the human body surface are generally learned from thousands of high-quality 3D scans in predefined poses to cover the wide variety of human body shapes and articulations. Acquisition of such data requires expensive equipment, calibration procedures, and is limited to cooperative subjects who can understand and follow instructions, such as adults. We present a method for learning a statistical 3D Skinned Multi-Infant Linear body model (SMIL) from incomplete, low-quality RGB-D sequences of freely moving infants. Quantitative experiments show that SMIL faithfully represents the RGB-D data and properly factorizes the shape and pose of the infants. To demonstrate the applicability of SMIL, we fit the model to RGB-D sequences of freely moving infants and show, with a case study, that our method captures enough motion detail for General Movements Assessment (GMA), a method used in clinical practice for early detection of neurodevelopmental disorders in infants. SMIL provides a new tool for analyzing infant shape and movement and is a step towards an automated system for GMA.

pdf Journal DOI [BibTex]

pdf Journal DOI [BibTex]


Thumb xl kenny
Perceptual Effects of Inconsistency in Human Animations

Kenny, S., Mahmood, N., Honda, C., Black, M. J., Troje, N. F.

ACM Trans. Appl. Percept., 16(1):2:1-2:18, Febuary 2019 (article)

Abstract
The individual shape of the human body, including the geometry of its articulated structure and the distribution of weight over that structure, influences the kinematics of a person’s movements. How sensitive is the visual system to inconsistencies between shape and motion introduced by retargeting motion from one person onto the shape of another? We used optical motion capture to record five pairs of male performers with large differences in body weight, while they pushed, lifted, and threw objects. From these data, we estimated both the kinematics of the actions as well as the performer’s individual body shape. To obtain consistent and inconsistent stimuli, we created animated avatars by combining the shape and motion estimates from either a single performer or from different performers. Using these stimuli we conducted three experiments in an immersive virtual reality environment. First, a group of participants detected which of two stimuli was inconsistent. Performance was very low, and results were only marginally significant. Next, a second group of participants rated perceived attractiveness, eeriness, and humanness of consistent and inconsistent stimuli, but these judgements of animation characteristics were not affected by consistency of the stimuli. Finally, a third group of participants rated properties of the objects rather than of the performers. Here, we found strong influences of shape-motion inconsistency on perceived weight and thrown distance of objects. This suggests that the visual system relies on its knowledge of shape and motion and that these components are assimilated into an altered perception of the action outcome. We propose that the visual system attempts to resist inconsistent interpretations of human animations. Actions involving object manipulations present an opportunity for the visual system to reinterpret the introduced inconsistencies as a change in the dynamics of an object rather than as an unexpected combination of body shape and body motion.

publisher pdf DOI [BibTex]

publisher pdf DOI [BibTex]


Thumb xl webteaser
Perceiving Systems (2016-2018)
Scientific Advisory Board Report, 2019 (misc)

pdf [BibTex]

pdf [BibTex]


Thumb xl virtualcaliper
The Virtual Caliper: Rapid Creation of Metrically Accurate Avatars from 3D Measurements

Pujades, S., Mohler, B., Thaler, A., Tesch, J., Mahmood, N., Hesse, N., Bülthoff, H. H., Black, M. J.

IEEE Transactions on Visualization and Computer Graphics, 25, pages: 1887,1897, IEEE, 2019 (article)

Abstract
Creating metrically accurate avatars is important for many applications such as virtual clothing try-on, ergonomics, medicine, immersive social media, telepresence, and gaming. Creating avatars that precisely represent a particular individual is challenging however, due to the need for expensive 3D scanners, privacy issues with photographs or videos, and difficulty in making accurate tailoring measurements. We overcome these challenges by creating “The Virtual Caliper”, which uses VR game controllers to make simple measurements. First, we establish what body measurements users can reliably make on their own body. We find several distance measurements to be good candidates and then verify that these are linearly related to 3D body shape as represented by the SMPL body model. The Virtual Caliper enables novice users to accurately measure themselves and create an avatar with their own body shape. We evaluate the metric accuracy relative to ground truth 3D body scan data, compare the method quantitatively to other avatar creation tools, and perform extensive perceptual studies. We also provide a software application to the community that enables novices to rapidly create avatars in fewer than five minutes. Not only is our approach more rapid than existing methods, it exports a metrically accurate 3D avatar model that is rigged and skinned.

Project Page IEEE Open Access IEEE Open Access PDF DOI [BibTex]

Project Page IEEE Open Access IEEE Open Access PDF DOI [BibTex]


Thumb xl model
Resisting Adversarial Attacks using Gaussian Mixture Variational Autoencoders

Ghosh, P., Losalka, A., Black, M. J.

In Proc. AAAI, 2019 (inproceedings)

Abstract
Susceptibility of deep neural networks to adversarial attacks poses a major theoretical and practical challenge. All efforts to harden classifiers against such attacks have seen limited success till now. Two distinct categories of samples against which deep neural networks are vulnerable, ``adversarial samples" and ``fooling samples", have been tackled separately so far due to the difficulty posed when considered together. In this work, we show how one can defend against them both under a unified framework. Our model has the form of a variational autoencoder with a Gaussian mixture prior on the latent variable, such that each mixture component corresponds to a single class. We show how selective classification can be performed using this model, thereby causing the adversarial objective to entail a conflict. The proposed method leads to the rejection of adversarial samples instead of misclassification, while maintaining high precision and recall on test data. It also inherently provides a way of learning a selective classifier in a semi-supervised scenario, which can similarly resist adversarial attacks. We further show how one can reclassify the detected adversarial samples by iterative optimization.

link (url) Project Page [BibTex]


Thumb xl rae
From Variational to Deterministic Autoencoders

Ghosh*, P., Sajjadi*, M. S. M., Vergari, A., Black, M. J., Schölkopf, B.

2019, *equal contribution (conference) Submitted

Abstract
Variational Autoencoders (VAEs) provide a theoretically-backed framework for deep generative models. However, they often produce “blurry” images, which is linked to their training objective. Sampling in the most popular implementation, the Gaussian VAE, can be interpreted as simply injecting noise to the input of a deterministic decoder. In practice, this simply enforces a smooth latent space structure. We challenge the adoption of the full VAE framework on this specific point in favor of a simpler, deterministic one. Specifically, we investigate how substituting stochasticity with other explicit and implicit regularization schemes can lead to a meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism for sampling new data points, we propose to employ an efficient ex-post density estimation step that can be readily adopted both for the proposed deterministic autoencoders as well as to improve sample quality of existing VAEs. We show in a rigorous empirical study that regularized deterministic autoencoding achieves state-of-the-art sample quality on the common MNIST, CIFAR-10 and CelebA datasets.

arXiv [BibTex]

2016


Thumb xl smpl
Skinned multi-person linear model

Black, M.J., Loper, M., Mahmood, N., Pons-Moll, G., Romero, J.

December 2016, Application PCT/EP2016/064610 (misc)

Abstract
The invention comprises a learned model of human body shape and pose dependent shape variation that is more accurate than previous models and is compatible with existing graphics pipelines. Our Skinned Multi-Person Linear model (SMPL) is a skinned vertex based model that accurately represents a wide variety of body shapes in natural human poses. The parameters of the model are learned from data including the rest pose template, blend weights, pose-dependent blend shapes, identity- dependent blend shapes, and a regressor from vertices to joint locations. Unlike previous models, the pose-dependent blend shapes are a linear function of the elements of the pose rotation matrices. This simple formulation enables training the entire model from a relatively large number of aligned 3D meshes of different people in different poses. The invention quantitatively evaluates variants of SMPL using linear or dual- quaternion blend skinning and show that both are more accurate than a Blend SCAPE model trained on the same data. In a further embodiment, the invention realistically models dynamic soft-tissue deformations. Because it is based on blend skinning, SMPL is compatible with existing rendering engines and we make it available for research purposes.

Google Patents [BibTex]

2016

Google Patents [BibTex]


Thumb xl psychscience
Creating body shapes from verbal descriptions by linking similarity spaces

Hill, M. Q., Streuber, S., Hahn, C. A., Black, M. J., O’Toole, A. J.

Psychological Science, 27(11):1486-1497, November 2016, (article)

Abstract
Brief verbal descriptions of bodies (e.g. curvy, long-legged) can elicit vivid mental images. The ease with which we create these mental images belies the complexity of three-dimensional body shapes. We explored the relationship between body shapes and body descriptions and show that a small number of words can be used to generate categorically accurate representations of three-dimensional bodies. The dimensions of body shape variation that emerged in a language-based similarity space were related to major dimensions of variation computed directly from three-dimensional laser scans of 2094 bodies. This allowed us to generate three-dimensional models of people in the shape space using only their coordinates on analogous dimensions in the language-based description space. Human descriptions of photographed bodies and their corresponding models matched closely. The natural mapping between the spaces illustrates the role of language as a concise code for body shape, capturing perceptually salient global and local body features.

pdf [BibTex]

pdf [BibTex]


Thumb xl smplify
Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M. J.

In Computer Vision – ECCV 2016, pages: 561-578, Lecture Notes in Computer Science, Springer International Publishing, October 2016 (inproceedings)

Abstract
We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we fi rst use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fi t it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.

pdf Video Sup Mat video Code Project Project Page [BibTex]

pdf Video Sup Mat video Code Project Project Page [BibTex]


Thumb xl gadde
Superpixel Convolutional Networks using Bilateral Inceptions

Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.

In European Conference on Computer Vision (ECCV), Lecture Notes in Computer Science, Springer, October 2016 (inproceedings)

Abstract
In this paper we propose a CNN architecture for semantic image segmentation. We introduce a new “bilateral inception” module that can be inserted in existing CNN architectures and performs bilateral filtering, at multiple feature-scales, between superpixels in an image. The feature spaces for bilateral filtering and other parameters of the module are learned end-to-end using standard backpropagation techniques. The bilateral inception module addresses two issues that arise with general CNN segmentation architectures. First, this module propagates information between (super) pixels while respecting image edges, thus using the structured information of the problem for improved results. Second, the layer recovers a full resolution segmentation result from the lower resolution solution of a CNN. In the experiments, we modify several existing CNN architectures by inserting our inception modules between the last CNN (1 × 1 convolution) layers. Empirical results on three different datasets show reliable improvements not only in comparison to the baseline networks, but also in comparison to several dense-pixel prediction techniques such as CRFs, while being competitive in time.

pdf supplementary poster Project Page Project Page [BibTex]

pdf supplementary poster Project Page Project Page [BibTex]


Thumb xl thumb
Barrista - Caffe Well-Served

Lassner, C., Kappler, D., Kiefel, M., Gehler, P.

In ACM Multimedia Open Source Software Competition, October 2016 (inproceedings)

Abstract
The caffe framework is one of the leading deep learning toolboxes in the machine learning and computer vision community. While it offers efficiency and configurability, it falls short of a full interface to Python. With increasingly involved procedures for training deep networks and reaching depths of hundreds of layers, creating configuration files and keeping them consistent becomes an error prone process. We introduce the barrista framework, offering full, pythonic control over caffe. It separates responsibilities and offers code to solve frequently occurring tasks for pre-processing, training and model inspection. It is compatible to all caffe versions since mid 2015 and can import and export .prototxt files. Examples are included, e.g., a deep residual network implemented in only 172 lines (for arbitrary depths), comparing to 2320 lines in the official implementation for the equivalent model.

pdf link (url) DOI Project Page [BibTex]

pdf link (url) DOI Project Page [BibTex]


Thumb xl screen shot 2016 07 25 at 13.52.05
Non-parametric Models for Structured Data and Applications to Human Bodies and Natural Scenes

Lehrmann, A.

ETH Zurich, July 2016 (phdthesis)

Abstract
The purpose of this thesis is the study of non-parametric models for structured data and their fields of application in computer vision. We aim at the development of context-sensitive architectures which are both expressive and efficient. Our focus is on directed graphical models, in particular Bayesian networks, where we combine the flexibility of non-parametric local distributions with the efficiency of a global topology with bounded treewidth. A bound on the treewidth is obtained by either constraining the maximum indegree of the underlying graph structure or by introducing determinism. The non-parametric distributions in the nodes of the graph are given by decision trees or kernel density estimators. The information flow implied by specific network topologies, especially the resultant (conditional) independencies, allows for a natural integration and control of contextual information. We distinguish between three different types of context: static, dynamic, and semantic. In four different approaches we propose models which exhibit varying combinations of these contextual properties and allow modeling of structured data in space, time, and hierarchies derived thereof. The generative character of the presented models enables a direct synthesis of plausible hypotheses. Extensive experiments validate the developed models in two application scenarios which are of particular interest in computer vision: human bodies and natural scenes. In the practical sections of this work we discuss both areas from different angles and show applications of our models to human pose, motion, and segmentation as well as object categorization and localization. Here, we benefit from the availability of modern datasets of unprecedented size and diversity. Comparisons to traditional approaches and state-of-the-art research on the basis of well-established evaluation criteria allows the objective assessment of our contributions.

pdf [BibTex]


Thumb xl cover
Dynamic baseline stereo vision-based cooperative target tracking

Ahmad, A., Ruff, E., Bülthoff, H.

19th International Conference on Information Fusion, pages: 1728-1734, July 2016 (conference)

Abstract
In this article we present a new method for multi-robot cooperative target tracking based on dynamic baseline stereo vision. The core novelty of our approach includes a computationally light-weight scheme to compute the 3D stereo measurements that exactly satisfy the epipolar constraints and a covariance intersection (CI)-based method to fuse the 3D measurements obtained by each individual robot. Using CI we are able to systematically integrate the robot localization uncertainties as well as the uncertainties in the measurements generated by the monocular camera images from each individual robot into the resulting stereo measurements. Through an extensive set of simulation and real robot results we show the robustness and accuracy of our approach with respect to ground truth. The source code related to this article is publicly accessible on our website and the datasets are available on request.

DOI [BibTex]

DOI [BibTex]


Thumb xl webteaser
Body Talk: Crowdshaping Realistic 3D Avatars with Words

Streuber, S., Quiros-Ramirez, M. A., Hill, M. Q., Hahn, C. A., Zuffi, S., O’Toole, A., Black, M. J.

ACM Trans. Graph. (Proc. SIGGRAPH), 35(4):54:1-54:14, July 2016 (article)

Abstract
Realistic, metrically accurate, 3D human avatars are useful for games, shopping, virtual reality, and health applications. Such avatars are not in wide use because solutions for creating them from high-end scanners, low-cost range cameras, and tailoring measurements all have limitations. Here we propose a simple solution and show that it is surprisingly accurate. We use crowdsourcing to generate attribute ratings of 3D body shapes corresponding to standard linguistic descriptions of 3D shape. We then learn a linear function relating these ratings to 3D human shape parameters. Given an image of a new body, we again turn to the crowd for ratings of the body shape. The collection of linguistic ratings of a photograph provides remarkably strong constraints on the metric 3D shape. We call the process crowdshaping and show that our Body Talk system produces shapes that are perceptually indistinguishable from bodies created from high-resolution scans and that the metric accuracy is sufficient for many tasks. This makes body “scanning” practical without a scanner, opening up new applications including database search, visualization, and extracting avatars from books.

pdf web tool video talk (ppt) [BibTex]

pdf web tool video talk (ppt) [BibTex]


Thumb xl teaser
DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation

Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.

In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 4929-4937, IEEE, June 2016 (inproceedings)

Abstract
This paper considers the task of articulated human pose estimation of multiple people in real-world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other. This joint formulation is in contrast to previous strategies, that address the problem by first detecting people and subsequently estimating their body pose. We propose a partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors. Our formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configurations of body parts respecting geometric and appearance constraints. Experiments on four different datasets demonstrate state-of-the-art results for both single person and multi person pose estimation.

code pdf supplementary DOI Project Page [BibTex]

code pdf supplementary DOI Project Page [BibTex]


Thumb xl tsaiteaser
Video segmentation via object flow

Tsai, Y., Yang, M., Black, M. J.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract
Video object segmentation is challenging due to fast moving objects, deforming shapes, and cluttered backgrounds. Optical flow can be used to propagate an object segmentation over time but, unfortunately, flow is often inaccurate, particularly around object boundaries. Such boundaries are precisely where we want our segmentation to be accurate. To obtain accurate segmentation across time, we propose an efficient algorithm that considers video segmentation and optical flow estimation simultaneously. For video segmentation, we formulate a principled, multiscale, spatio-temporal objective function that uses optical flow to propagate information between frames. For optical flow estimation, particularly at object boundaries, we compute the flow independently in the segmented regions and recompose the results. We call the process object flow and demonstrate the effectiveness of jointly optimizing optical flow and video segmentation using an iterative scheme. Experiments on the SegTrack v2 and Youtube-Objects datasets show that the proposed algorithm performs favorably against the other state-of-the-art methods.

pdf [BibTex]

pdf [BibTex]


Thumb xl capital
Patches, Planes and Probabilities: A Non-local Prior for Volumetric 3D Reconstruction

Ulusoy, A. O., Black, M. J., Geiger, A.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract
In this paper, we propose a non-local structured prior for volumetric multi-view 3D reconstruction. Towards this goal, we present a novel Markov random field model based on ray potentials in which assumptions about large 3D surface patches such as planarity or Manhattan world constraints can be efficiently encoded as probabilistic priors. We further derive an inference algorithm that reasons jointly about voxels, pixels and image segments, and estimates marginal distributions of appearance, occupancy, depth, normals and planarity. Key to tractable inference is a novel hybrid representation that spans both voxel and pixel space and that integrates non-local information from 2D image segmentations in a principled way. We compare our non-local prior to commonly employed local smoothness assumptions and a variety of state-of-the-art volumetric reconstruction baselines on challenging outdoor scenes with textureless and reflective surfaces. Our experiments indicate that regularizing over larger distances has the potential to resolve ambiguities where local regularizers fail.

YouTube pdf poster suppmat Project Page [BibTex]


Thumb xl ijcv tumb
Capturing Hands in Action using Discriminative Salient Points and Physics Simulation

Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.

International Journal of Computer Vision (IJCV), 118(2):172-193, June 2016 (article)

Abstract
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.

Website pdf link (url) DOI Project Page [BibTex]

Website pdf link (url) DOI Project Page [BibTex]


Thumb xl header
Optical Flow with Semantic Segmentation and Localized Layers

Sevilla-Lara, L., Sun, D., Jampani, V., Black, M. J.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 3889-3898, June 2016 (inproceedings)

Abstract
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class. Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on roads with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.

video Kitti Precomputed Data (1.6GB) pdf YouTube Sequences Code Project Page Project Page [BibTex]


Thumb xl tes cvpr16 bilateral
Learning Sparse High Dimensional Filters: Image Filtering, Dense CRFs and Bilateral Neural Networks

Jampani, V., Kiefel, M., Gehler, P. V.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 4452-4461, June 2016 (inproceedings)

Abstract
Bilateral filters have wide spread use due to their edge-preserving properties. The common use case is to manually choose a parametric filter type, usually a Gaussian filter. In this paper, we will generalize the parametrization and in particular derive a gradient descent algorithm so the filter parameters can be learned from data. This derivation allows to learn high dimensional linear filters that operate in sparsely populated feature spaces. We build on the permutohedral lattice construction for efficient filtering. The ability to learn more general forms of high-dimensional filters can be used in several diverse applications. First, we demonstrate the use in applications where single filter applications are desired for runtime reasons. Further, we show how this algorithm can be used to learn the pairwise potentials in densely connected conditional random fields and apply these to different image segmentation tasks. Finally, we introduce layers of bilateral filters in CNNs and propose bilateral neural networks for the use of high-dimensional sparse data. This view provides new ways to encode model structure into network architectures. A diverse set of experiments empirically validates the usage of general forms of filters.

project page code CVF open-access pdf supplementary poster Project Page Project Page [BibTex]


Thumb xl futeaser
Occlusion boundary detection via deep exploration of context

Fu, H., Wang, C., Tao, D., Black, M. J.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract
Occlusion boundaries contain rich perceptual information about the underlying scene structure. They also provide important cues in many visual perception tasks such as scene understanding, object recognition, and segmentation. In this paper, we improve occlusion boundary detection via enhanced exploration of contextual information (e.g., local structural boundary patterns, observations from surrounding regions, and temporal context), and in doing so develop a novel approach based on convolutional neural networks (CNNs) and conditional random fields (CRFs). Experimental results demonstrate that our detector significantly outperforms the state-of-the-art (e.g., improving the F-measure from 0.62 to 0.71 on the commonly used CMU benchmark). Last but not least, we empirically assess the roles of several important components of the proposed detector, so as to validate the rationale behind this approach.

pdf [BibTex]

pdf [BibTex]


Thumb xl jun teaser
Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer

Xie, J., Kiefel, M., Sun, M., Geiger, A.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract
Semantic annotations are vital for training models for object recognition, semantic segmentation or scene understanding. Unfortunately, pixelwise annotation of images at very large scale is labor-intensive and only little labeled data is available, particularly at instance level and for street scenes. In this paper, we propose to tackle this problem by lifting the semantic instance labeling task from 2D into 3D. Given reconstructions from stereo or laser data, we annotate static 3D scene elements with rough bounding primitives and develop a probabilistic model which transfers this information into the image domain. We leverage our method to obtain 2D labels for a novel suburban video dataset which we have collected, resulting in 400k semantic and instance image annotations. A comparison of our method to state-of-the-art label transfer baselines reveals that 3D information enables more efficient annotation while at the same time resulting in improved accuracy and time-coherent labels.

pdf suppmat Project Page Project Page [BibTex]

pdf suppmat Project Page Project Page [BibTex]


Thumb xl appealingavatarsbig
Appealing female avatars from 3D body scans: Perceptual effects of stylization

Fleming, R., Mohler, B., Romero, J., Black, M. J., Breidt, M.

In 11th Int. Conf. on Computer Graphics Theory and Applications (GRAPP), Febuary 2016 (inproceedings)

Abstract
Advances in 3D scanning technology allow us to create realistic virtual avatars from full body 3D scan data. However, negative reactions to some realistic computer generated humans suggest that this approach might not always provide the most appealing results. Using styles derived from existing popular character designs, we present a novel automatic stylization technique for body shape and colour information based on a statistical 3D model of human bodies. We investigate whether such stylized body shapes result in increased perceived appeal with two different experiments: One focuses on body shape alone, the other investigates the additional role of surface colour and lighting. Our results consistently show that the most appealing avatar is a partially stylized one. Importantly, avatars with high stylization or no stylization at all were rated to have the least appeal. The inclusion of colour information and improvements to render quality had no significant effect on the overall perceived appeal of the avatars, and we observe that the body shape primarily drives the change in appeal ratings. For body scans with colour information, we found that a partially stylized avatar was most effective, increasing average appeal ratings by approximately 34%.

pdf Project Page [BibTex]

pdf Project Page [BibTex]


Thumb xl teaser web
Human Pose Estimation from Video and IMUs

Marcard, T. V., Pons-Moll, G., Rosenhahn, B.

Transactions on Pattern Analysis and Machine Intelligence PAMI, 38(8):1533-1547, January 2016 (article)

data pdf dataset_documentation [BibTex]

data pdf dataset_documentation [BibTex]


Thumb xl teaser
Deep Discrete Flow

Güney, F., Geiger, A.

Asian Conference on Computer Vision (ACCV), 2016 (conference) Accepted

pdf suppmat Project Page [BibTex]

pdf suppmat Project Page [BibTex]


Thumb xl both testbed cropped
Moving-horizon Nonlinear Least Squares-based Multirobot Cooperative Perception

Ahmad, A., Bülthoff, H.

Robotics and Autonomous Systems, 83, pages: 275-286, 2016 (article)

Abstract
In this article we present an online estimator for multirobot cooperative localization and target tracking based on nonlinear least squares minimization. Our method not only makes the rigorous optimization-based approach applicable online but also allows the estimator to be stable and convergent. We do so by employing a moving horizon technique to nonlinear least squares minimization and a novel design of the arrival cost function that ensures stability and convergence of the estimator. Through an extensive set of real robot experiments, we demonstrate the robustness of our method as well as the optimality of the arrival cost function. The experiments include comparisons of our method with i) an extended Kalman filter-based online-estimator and ii) an offline-estimator based on full-trajectory nonlinear least squares.

DOI Project Page [BibTex]

DOI Project Page [BibTex]


Thumb xl sabteaser
Perceiving Systems (2011-2015)
Scientific Advisory Board Report, 2016 (misc)

pdf [BibTex]

pdf [BibTex]


Thumb xl siyong
Shape estimation of subcutaneous adipose tissue using an articulated statistical shape model

Yeo, S. Y., Romero, J., Loper, M., Machann, J., Black, M.

Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 0(0):1-8, 2016 (article)

publisher website preprint pdf link (url) DOI Project Page [BibTex]

publisher website preprint pdf link (url) DOI Project Page [BibTex]


Thumb xl siyu eccvw
Multi-Person Tracking by Multicuts and Deep Matching

(Winner of the Multi-Object Tracking Challenge ECCV 2016)

Tang, S., Andres, B., Andriluka, M., Schiele, B.

ECCV Workshop on Benchmarking Mutliple Object Tracking, 2016 (conference)

PDF [BibTex]

PDF [BibTex]


Thumb xl website thumbnail
Reconstructing Articulated Rigged Models from RGB-D Videos

Tzionas, D., Gall, J.

In European Conference on Computer Vision Workshops 2016 (ECCVW’16) - Workshop on Recovering 6D Object Pose (R6D’16), pages: 620-633, Springer International Publishing, 2016 (inproceedings)

Abstract
Although commercial and open-source software exist to reconstruct a static object from a sequence recorded with an RGB-D sensor, there is a lack of tools that build rigged models of articulated objects that deform realistically and can be used for tracking or animation. In this work, we fill this gap and propose a method that creates a fully rigged model of an articulated object from depth data of a single sensor. To this end, we combine deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow. The fully rigged model then consists of a watertight mesh, embedded skeleton, and skinning weights.

pdf suppl Project's Website YouTube link (url) DOI [BibTex]

pdf suppl Project's Website YouTube link (url) DOI [BibTex]


Thumb xl jointmc
A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects

Keuper, M., Tang, S., Yu, Z., Andres, B., Brox, T., Schiele, B.

In arXiv:1607.06317, 2016 (inproceedings)

PDF [BibTex]

PDF [BibTex]


Thumb xl screen shot 2016 02 22 at 11.46.41
The GRASP Taxonomy of Human Grasp Types

Feix, T., Romero, J., Schmiedmayer, H., Dollar, A., Kragic, D.

Human-Machine Systems, IEEE Transactions on, 46(1):66-77, 2016 (article)

publisher website pdf DOI Project Page [BibTex]

publisher website pdf DOI Project Page [BibTex]


Thumb xl pami
Map-Based Probabilistic Visual Self-Localization

Brubaker, M. A., Geiger, A., Urtasun, R.

IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2016 (article)

Abstract
Accurate and efficient self-localization is a critical problem for autonomous systems. This paper describes an affordable solution to vehicle self-localization which uses odometry computed from two video cameras and road maps as the sole inputs. The core of the method is a probabilistic model for which an efficient approximate inference algorithm is derived. The inference algorithm is able to utilize distributed computation in order to meet the real-time requirements of autonomous systems in some instances. Because of the probabilistic nature of the model the method is capable of coping with various sources of uncertainty including noise in the visual odometry and inherent ambiguities in the map (e.g., in a Manhattan world). By exploiting freely available, community developed maps and visual odometry measurements, the proposed method is able to localize a vehicle to 4m on average after 52 seconds of driving on maps which contain more than 2,150km of drivable roads.

pdf Project Page [BibTex]

pdf Project Page [BibTex]

1991


Thumb xl ijcai91
Dynamic motion estimation and feature extraction over long image sequences

Black, M. J., Anandan, P.

In Proc. IJCAI Workshop on Dynamic Scene Understanding, Sydney, Australia, August 1991 (inproceedings)

[BibTex]

1991

[BibTex]


Thumb xl bildschirmfoto 2013 01 14 um 12.06.42
Robust dynamic motion estimation over time

(IEEE Computer Society Outstanding Paper Award)

Black, M. J., Anandan, P.

In Proc. Computer Vision and Pattern Recognition, CVPR-91,, pages: 296-302, Maui, Hawaii, June 1991 (inproceedings)

Abstract
This paper presents a novel approach to incrementally estimating visual motion over a sequence of images. We start by formulating constraints on image motion to account for the possibility of multiple motions. This is achieved by exploiting the notions of weak continuity and robust statistics in the formulation of the minimization problem. The resulting objective function is non-convex. Traditional stochastic relaxation techniques for minimizing such functions prove inappropriate for the task. We present a highly parallel incremental stochastic minimization algorithm which has a number of advantages over previous approaches. The incremental nature of the scheme makes it truly dynamic and permits the detection of occlusion and disocclusion boundaries.

pdf video abstract [BibTex]

pdf video abstract [BibTex]