197 results (BibTeX)

2017


Thumb md manoteaser
Embodied Hands: Modeling and Capturing Hands and Bodies Together

Romero, J., Tzionas, D., J. Black, M.

ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 245:1–245:17, November 2017, (*) Two first authors contributed equally (article)

Abstract
Humans move their hands and bodies together to communicate and solve tasks. Capturing and replicating such coordinated activity is critical for virtual characters that behave realistically. Surprisingly, most methods treat the 3D modeling and tracking of bodies and hands separately. Here we formulate a model of hands and bodies interacting together and fit it to full-body 4D sequences. When scanning or capturing the full body in 3D, hands are small and often partially occluded, making their shape and pose hard to recover. To cope with low-resolution, occlusion, and noise, we develop a new model called MANO (hand Model with Articulated and Non-rigid defOrmations). MANO is learned from around 1000 high-resolution 3D scans of hands of 31 subjects in a wide variety of hand poses. The model is realistic, low-dimensional, captures non-rigid shape changes with pose, is compatible with standard graphics packages, and can fit any human hand. MANO provides a compact mapping from hand poses to pose blend shape corrections and a linear manifold of pose synergies. We attach MANO to a standard parameterized 3D body shape model (SMPL), resulting in a fully articulated body and hand model (SMPL+H). We illustrate SMPL+H by fitting complex, natural, activities of subjects captured with a 4D scanner. The fitting is fully automatic and results in full body models that move naturally with detailed hand motions and a realism not seen before in full body performance capture. The models and data are freely available for research purposes at http://mano.is.tue.mpg.de.

website youtube paper suppl video DOI Project Page [BibTex]

2017

website youtube paper suppl video DOI Project Page [BibTex]


Thumb md flamewebteaserwide
Learning a model of facial shape and expression from 4D scans

Li, T., Bolkart, T., J. Black, M., Li, H., Romero, J.

ACM Transactions on Graphics, 36(6):194:1-194:17, November 2017, Two first authors contributed equally (article)

Abstract
The field of 3D face modeling has a large gap between high-end and low-end methods. At the high end, the best facial animation is indistinguishable from real humans, but this comes at the cost of extensive manual labor. At the low end, face capture from consumer depth sensors relies on 3D face models that are not expressive enough to capture the variability in natural facial shape and expression. We seek a middle ground by learning a facial model from thousands of accurately aligned 3D scans. Our FLAME model (Faces Learned with an Articulated Model and Expressions) is designed to work with existing graphics software and be easy to fit to data. FLAME uses a linear shape space trained from 3800 scans of human heads. FLAME combines this linear shape space with an articulated jaw, neck, and eyeballs, pose-dependent corrective blendshapes, and additional global expression from 4D face sequences in the D3DFACS dataset along with additional 4D sequences.We accurately register a template mesh to the scan sequences and make the D3DFACS registrations available for research purposes. In total the model is trained from over 33, 000 scans. FLAME is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model. We compare FLAME to these models by fitting them to static 3D scans and 4D sequences using the same optimization method. FLAME is significantly more accurate and is available for research purposes (http://flame.is.tue.mpg.de).

data/model video paper supplemental [BibTex]

data/model video paper supplemental [BibTex]


Thumb md website teaser
Semantic Video CNNs through Representation Warping

Gadde, R., Jampani, V., Gehler, P.

In IEEE International Conference on Computer Vision (ICCV), 2017 (inproceedings) Accepted

Abstract
In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very lit- tle extra computational cost. This module is called Net- Warp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network repre- sentations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to- end training. Experiments validate that the proposed ap- proach incurs only little extra computational cost, while im- proving performance, when video streams are available. We achieve new state-of-the-art results on the standard CamVid and Cityscapes benchmark datasets and show reliable im- provements over different baseline networks. Our code and models are available at http://segmentation.is. tue.mpg.de

pdf Supplementary [BibTex]

pdf Supplementary [BibTex]


Thumb md animage2mask3
Assessing body image in anorexia nervosa using biometric self-avatars in virtual reality: Attitudinal components rather than visual body size estimation are distorted

Claire Mölbert, S., Thaler, A., J. Mohler, B., Streuber, S., Romero, J., J. Black, M., Zipfel, S., Karnath, H., Elisabeth Giel, K.

Psychological Medicine, 2017 (article)

Abstract
Background: Body image disturbance (BID) is a core symptom of anorexia nervosa (AN), but as yet distinctive features of BID are unknown. The present study aimed at disentangling perceptual and attitudinal components of BID in AN. Methods: We investigated n=24 women with AN and n=24 controls. Based on a 3D body scan, we created realistic virtual 3D bodies (avatars) for each participant that were varied through a range of ±20% of the participants' weights. Avatars were presented in a virtual reality mirror scenario. Using different psychophysical tasks, participants identified and adjusted their actual and their desired body weight. To test for general perceptual biases in estimating body weight, a second experiment investigated perception of weight and shape matched avatars with another identity. Results: Women with AN and controls underestimated their weight, with a trend that women with AN underestimated more. The average desired body of controls had normal weight while the average desired weight of women with AN corresponded to extreme AN (DSM-5). Correlation analyses revealed that desired body weight, but not accuracy of weight estimation, was associated with eating disorder symptoms. In the second experiment, both groups estimated accurately while the most attractive body was similar to Experiment 1. Conclusions: Our results contradict the widespread assumption that patients with AN overestimate their body weight due to visual distortions. Rather, they illustrate that BID might be driven by distorted attitudes with regard to the desired body. Clinical interventions should aim at helping patients with AN to change their desired weight.

doi pdf [BibTex]


Thumb md image
Human Shape Estimation using Statistical Body Models

Maverick Loper, M.

University of Tübingen, May 2017 (thesis)

Abstract
Human body estimation methods transform real-world observations into predictions about human body state. These estimation methods benefit a variety of health, entertainment, clothing, and ergonomics applications. State may include pose, overall body shape, and appearance. Body state estimation is underconstrained by observations; ambiguity presents itself both in the form of missing data within observations, and also in the form of unknown correspondences between observations. We address this challenge with the use of a statistical body model: a data-driven virtual human. This helps resolve ambiguity in two ways. First, it fills in missing data, meaning that incomplete observations still result in complete shape estimates. Second, the model provides a statistically-motivated penalty for unlikely states, which enables more plausible body shape estimates. Body state inference requires more than a body model; we therefore build obser- vation models whose output is compared with real observations. In this thesis, body state is estimated from three types of observations: 3D motion capture markers, depth and color images, and high-resolution 3D scans. In each case, a forward process is proposed which simulates observations. By comparing observations to the results of the forward process, state can be adjusted to minimize the difference between simulated and observed data. We use gradient-based methods because they are critical to the precise estimation of state with a large number of parameters. The contributions of this work include three parts. First, we propose a method for the estimation of body shape, nonrigid deformation, and pose from 3D markers. Second, we present a concise approach to differentiating through the rendering process, with application to body shape estimation. And finally, we present a statistical body model trained from human body scans, with state-of-the-art fidelity, good runtime performance, and compatibility with existing animation packages.

Official Version [BibTex]


Thumb md cover tro paper
An Online Scalable Approach to Unified Multirobot Cooperative Localization and Object Tracking

Ahmad, A., Lawless, G., Lima, P.

IEEE Transactions on Robotics (T-RO), 2017, Accepted, May 2017 (article)

Abstract
In this article we present a unified approach for multi-robot cooperative simultaneous localization and object tracking based on particle filters. Our approach is scalable with respect to the number of robots in the team. We introduce a method that reduces, from an exponential to a linear growth, the space and computation time requirements with respect to the number of robots in order to maintain a given level of accuracy in the full state estimation. Our method requires no increase in the number of particles with respect to the number of robots. However, in our method each particle represents a full state hypothesis, leading to the linear dependency on the number of robots of both space and time complexity. The derivation of the algorithm implementing our approach from a standard particle filter algorithm and its complexity analysis are presented. Through an extensive set of simulation experiments on a large number of randomized datasets, we demonstrate the correctness and efficacy of our approach. Through real robot experiments on a standardized open dataset of a team of four soccer playing robots tracking a ball, we evaluate our method's estimation accuracy with respect to the ground truth values. Through comparisons with other methods based on i) nonlinear least squares minimization and ii) joint extended Kalman filter, we further highlight our method's advantages. Finally, we also present a robustness test for our approach by evaluating it under scenarios of communication and vision failure in teammate robots.

accepted pre-print version [BibTex]


Thumb md phd thesis teaser
Learning Inference Models for Computer Vision

Jampani, V.

MPI for Intelligent Systems and University of Tübingen, 2017 (phdthesis)

Abstract
Computer vision can be understood as the ability to perform 'inference' on image data. Breakthroughs in computer vision technology are often marked by advances in inference techniques, as even the model design is often dictated by the complexity of inference in them. This thesis proposes learning based inference schemes and demonstrates applications in computer vision. We propose techniques for inference in both generative and discriminative computer vision models. Despite their intuitive appeal, the use of generative models in vision is hampered by the difficulty of posterior inference, which is often too complex or too slow to be practical. We propose techniques for improving inference in two widely used techniques: Markov Chain Monte Carlo (MCMC) sampling and message-passing inference. Our inference strategy is to learn separate discriminative models that assist Bayesian inference in a generative model. Experiments on a range of generative vision models show that the proposed techniques accelerate the inference process and/or converge to better solutions. A main complication in the design of discriminative models is the inclusion of prior knowledge in a principled way. For better inference in discriminative models, we propose techniques that modify the original model itself, as inference is simple evaluation of the model. We concentrate on convolutional neural network (CNN) models and propose a generalization of standard spatial convolutions, which are the basic building blocks of CNN architectures, to bilateral convolutions. First, we generalize the existing use of bilateral filters and then propose new neural network architectures with learnable bilateral filters, which we call `Bilateral Neural Networks'. We show how the bilateral filtering modules can be used for modifying existing CNN architectures for better image segmentation and propose a neural network approach for temporal information propagation in videos. Experiments demonstrate the potential of the proposed bilateral networks on a wide range of vision tasks and datasets. In summary, we propose learning based techniques for better inference in several computer vision models ranging from inverse graphics to freely parameterized neural networks. In generative vision models, our inference techniques alleviate some of the crucial hurdles in Bayesian posterior inference, paving new ways for the use of model based machine learning in vision. In discriminative CNN models, the proposed filter generalizations aid in the design of new neural network architectures that can handle sparse high-dimensional data as well as provide a way for incorporating prior knowledge into CNNs.

pdf [BibTex]

pdf [BibTex]


Thumb md mrflow
Optical Flow in Mostly Rigid Scenes

Wulff, J., Sevilla-Lara, L., J. Black, M.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
The optical flow of natural scenes is a combination of the motion of the observer and the independent motion of objects. Existing algorithms typically focus on either recovering motion and structure under the assumption of a purely static world or optical flow for general unconstrained scenes. We combine these approaches in an optical flow algorithm that estimates an explicit segmentation of moving objects from appearance and physical constraints. In static regions we take advantage of strong constraints to jointly estimate the camera motion and the 3D structure of the scene over multiple frames. This allows us to also regularize the structure instead of the motion. Our formulation uses a Plane+Parallax framework, which works even under small baselines, and reduces the motion estimation to a one-dimensional search problem, resulting in more accurate estimation. In moving regions the flow is treated as unconstrained, and computed with an existing optical flow method. The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art results on both the MPISintel and KITTI-2015 benchmarks.

pdf SupMat video code Project Page [BibTex]

pdf SupMat video code Project Page [BibTex]


Thumb md martinez
On human motion prediction using recurrent neural networks

Martinez, J., J. Black, M., Romero, J.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.

arXiv [BibTex]

arXiv [BibTex]


Thumb md pyramid
Optical Flow Estimation using a Spatial Pyramid Network

Ranjan, A., Black, M. J.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.

pdf SupMat project/code [BibTex]

pdf SupMat project/code [BibTex]


Thumb md teasercrop
A Generative Model of People in Clothing

Lassner, C., Pons-Moll, G., V. Gehler, P.

In International Conference on Computer Vision (ICCV), 2017 (inproceedings)

Abstract
We present the first image-based generative model of people in clothing in a full-body setting. We sidestep the commonly used complex graphics rendering pipeline and the need for high-quality 3D scans of dressed people. Instead, we learn generative models from a large image database. The main challenge is to cope with the high variance in human pose, shape and appearance. For this reason, pure image-based approaches have not been considered so far. We show that this challenge can be overcome by splitting the generating process in two parts. First, we learn to generate a semantic segmentation of the body and clothing. Second, we learn a conditional model on the resulting segments that creates realistic images. The full model is differentiable and can be conditioned on pose, shape or color. The result are samples of people in different clothing items and styles. The proposed model can generate entirely new people with realistic clothing. In several experiments we present encouraging results that suggest an entirely data-driven approach to people generation is possible.

link (url) [BibTex]

link (url) [BibTex]


Thumb md early stopping teaser
Early Stopping Without a Validation Set

Mahsereci, M., Balles, L., Lassner, C., Hennig, P.

arXiv preprint arXiv:1703.09580, 2017 (article)

Abstract
Early stopping is a widely used technique to prevent poor generalization performance when training an over-expressive model by means of gradient-based optimization. To find a good point to halt the optimizer, a common practice is to split the dataset into a training and a smaller validation set to obtain an ongoing estimate of the generalization performance. In this paper we propose a novel early stopping criterion which is based on fast-to-compute, local statistics of the computed gradients and entirely removes the need for a held-out validation set. Our experiments show that this is a viable approach in the setting of least-squares and logistic regression as well as neural networks.

link (url) Project Page [BibTex]


Thumb md teasercrop
Unite the People: Closing the Loop Between 3D and 2D Human Representations

Lassner, C., Romero, J., Kiefel, M., Bogo, F., J. Black, M., V. Gehler, P.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits “in-the-wild”. However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-of-the art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.

arXiv project/code/data [BibTex]

arXiv project/code/data [BibTex]


Thumb md surrealin
Learning from Synthetic Humans

Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., Schmid, C.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

arXiv project data [BibTex]

arXiv project data [BibTex]


Thumb md judith
Deep representation learning for human motion prediction and classification

Bütepage, J., Black, M. J., Kragic, D., Kjellström, H.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Generative models of 3D human motion are often restricted to a small number of activities and can therefore not generalize well to novel movements or applications. In this work we propose a deep learning framework for human motion capture data that learns a generic representation from a large corpus of motion capture data and generalizes well to new, unseen, motions. Using an encoding-decoding network that learns to predict future 3D poses from the most recent past, we extract a feature representation of human motion. Most work on deep learning for sequence prediction focuses on video and speech. Since skeletal data has a different structure, we present and evaluate different network architectures that make different assumptions about time dependencies and limb correlations. To quantify the learned features, we use the output of different layers for action classification and visualize the receptive fields of the network units. Our method outperforms the recent state of the art in skeletal motion prediction even though these use action specific training data. Our results show that deep feedforward networks, trained from a generic mocap database, can successfully be used for feature extraction from human motion data and that this representation can be used as a foundation for classification and prediction.

arXiv [BibTex]

arXiv [BibTex]


Thumb md joel slow flow crop
Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data

Janai, J., Güney, F., Wulff, J., Black, M. J., Geiger, A.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.

pdf suppmat Project page Video [BibTex]


Thumb md teaser
Dynamic FAUST: Registering Human Bodies in Motion

Bogo, F., Romero, J., Pons-Moll, G., J. Black, M.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
While the ready availability of 3D scan data has influenced research throughout computer vision, less attention has focused on 4D data; that is 3D scans of moving nonrigid objects, captured over time. To be useful for vision research, such 4D scans need to be registered, or aligned, to a common topology. Consequently, extending mesh registration methods to 4D is important. Unfortunately, no ground-truth datasets are available for quantitative evaluation and comparison of 4D registration methods. To address this we create a novel dataset of high-resolution 4D scans of human subjects in motion, captured at 60 fps. We propose a new mesh registration method that uses both 3D geometry and texture information to register all scans in a sequence to a common reference topology. The approach exploits consistency in texture over both short and long time intervals and deals with temporal offsets between shape and texture capture. We show how using geometry alone results in significant errors in alignment when the motions are fast and non-rigid. We evaluate the accuracy of our registration and provide a dataset of 40,000 raw and aligned meshes. Dynamic FAUST extends the popular FAUST dataset to dynamic 4D data, and is available for research purposes at http://dfaust.is.tue.mpg.de.

pdf video Project Page [BibTex]

pdf video Project Page [BibTex]


Thumb md cvpr2017 landpsace
Semantic Multi-view Stereo: Jointly Estimating Objects and Voxels

Osman Ulusoy, A., J. Black, M., Geiger, A.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Dense 3D reconstruction from RGB images is a highly ill-posed problem due to occlusions, textureless or reflective surfaces, as well as other challenges. We propose object-level shape priors to address these ambiguities. Towards this goal, we formulate a probabilistic model that integrates multi-view image evidence with 3D shape information from multiple objects. Inference in this model yields a dense 3D reconstruction of the scene as well as the existence and precise 3D pose of the objects in it. Our approach is able to recover fine details not captured in the input shapes while defaulting to the input models in occluded regions where image evidence is weak. Due to its probabilistic nature, the approach is able to cope with the approximate geometry of the 3D models as well as input shapes that are not present in the scene. We evaluate the approach quantitatively on several challenging indoor and outdoor datasets.

YouTube pdf suppmat [BibTex]

YouTube pdf suppmat [BibTex]


Thumb md slide1
3D Menagerie: Modeling the 3D Shape and Pose of Animals

Zuffi, S., Kanazawa, A., Jacobs, D., J. Black, M.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
There has been significant work on learning realistic, articulated, 3D models of the human body. In contrast, there are few such models of animals, despite many applications. The main challenge is that animals are much less cooperative than humans. The best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. Consequently, we learn our model from a small set of 3D scans of toy figurines in arbitrary poses. We employ a novel part-based shape model to compute an initial registration to the scans. We then normalize their pose, learn a statistical shape model, and refine the registrations and the model together. In this way, we accurately align animal scans from different quadruped families with very different shapes and poses. With the registration to a common template we learn a shape space representing animals including lions, cats, dogs, horses, cows and hippos. Animal shapes can be sampled from the model, posed, animated, and fit to data. We demonstrate generalization by fitting it to images of real animals including species not seen in training.

pdf video [BibTex]

pdf video [BibTex]


Thumb md 1611.04399 image
Joint Graph Decomposition and Node Labeling by Local Search

Levinkov, E., Uhrig, J., Tang, S., Omran, M., Insafutdinov, E., Kirillov, A., Rother, C., Brox, T., Schiele, B., Andres, B.

CVPR, 2017 (conference)

PDF Supplementary [BibTex]

PDF Supplementary [BibTex]


Thumb md untitled
Articulated Multi-person Tracking in the Wild

Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B., Schiele, B.

In CVPR, 2017, Oral (inproceedings)

[BibTex]

[BibTex]


Thumb md anja
Generating Descriptions with Grounded and Co-Referenced People

Rohrbach, A., Rohrbach, M., Tang, S., Oh, S., Schiele, B.

In CVPR, 2017 (inproceedings)

PDF [BibTex]

PDF [BibTex]


Thumb md web image
ClothCap: Seamless 4D Clothing Capture and Retargeting

Pons-Moll, G., Pujades, S., Hu, S., Black, M. J.

ACM Transactions on Graphics, (Proc. SIGGRAPH), 36(4), 2017, Two first authors contributed equally (article)

Abstract
Designing and simulating realistic clothing is challenging and, while several methods have addressed the capture of clothing from 3D scans, previous methods have been limited to single garments and simple motions, lack detail, or require specialized texture patterns. Here we address the problem of capturing regular clothing on fully dressed people in motion. People typically wear multiple pieces of clothing at a time. To estimate the shape of such clothing, track it over time, and render it believably, each garment must be segmented from the others and the body. Our ClothCap approach uses a new multi-part 3D model of clothed bodies, automatically segments each piece of clothing, estimates the naked body shape and pose under the clothing, and tracks the 3D deformations of the clothing over time. We estimate the garments and their motion from 4D scans; that is, high-resolution 3D scans of the subject in motion at 60 fps. The model allows us to capture a clothed person in motion, extract their clothing, and retarget the clothing to new body shapes. ClothCap provides a step towards virtual try-on with a technology for capturing, modeling, and analyzing clothing in motion.

video project_page paper link (url) Project Page [BibTex]

video project_page paper link (url) Project Page [BibTex]


Thumb md web image
Data-Driven Physics for Human Soft Tissue Animation

Kim, M., Pons-Moll, G., Pujades, S., Bang, S., Kim, J., Black, M. J., Lee, S.

ACM Transactions on Graphics, (Proc. SIGGRAPH), 36(4), 2017 (article)

Abstract
Data driven models of human poses and soft-tissue deformations can produce very realistic results, but they only model the visible surface of the human body and cannot create skin deformation due to interactions with the environment. Physical simulations can generalize to external forces, but their parameters are difficult to control. In this paper, we present a layered volumetric human body model learned from data. Our model is composed of a data-driven inner layer and a physics-based external layer. The inner layer is driven with a volumetric statistical body model (VSMPL). The soft tissue layer consists of a tetrahedral mesh that is driven using the finite element method (FEM). Model parameters, namely the segmentation of the body into layers and the soft tissue elasticity, are learned directly from 4D registrations of humans exhibiting soft tissue deformations. The learned two layer model is a realistic full-body avatar that generalizes to novel motions and external forces. Experiments show that the resulting avatars produce realistic results on held out sequences and react to external forces. Moreover, the model supports the retargeting of physical properties from one avatar when they share the same topology.

video paper link (url) [BibTex]

video paper link (url) [BibTex]


Thumb md pami 2017 teaser
Efficient 2D and 3D Facade Segmentation using Auto-Context

Gadde, R., Jampani, V., Marlet, R., Gehler, P.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017 (article)

Abstract
This paper introduces a fast and efficient segmentation technique for 2D images and 3D point clouds of building facades. Facades of buildings are highly structured and consequently most methods that have been proposed for this problem aim to make use of this strong prior information. Contrary to most prior work, we are describing a system that is almost domain independent and consists of standard segmentation methods. We train a sequence of boosted decision trees using auto-context features. This is learned using stacked generalization. We find that this technique performs better, or comparable with all previous published methods and present empirical results on all available 2D and 3D facade benchmark datasets. The proposed method is simple to implement, easy to extend, and very efficient at test-time inference.

arXiv Project Page [BibTex]

arXiv Project Page [BibTex]


Thumb md 71341 r guided
Reflectance Adaptive Filtering Improves Intrinsic Image Estimation

Nestmeyer, T., Gehler, P.

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

pre-print Project Page [BibTex]

pre-print Project Page [BibTex]


Thumb md vpn teaser
Video Propagation Networks

Jampani, V., Gadde, R., V. Gehler, P.

In IEEE Conference on Computer Vision and Patter Recognition (CVPR), 2017 (inproceedings)

pdf supplementary arXiv project page code [BibTex]

pdf supplementary arXiv project page code [BibTex]


Thumb md web teaser
Detailed, accurate, human shape estimation from clothed 3D scan sequences

Zhang, C., Pujades, S., Black, M. J., Pons-Moll, G.

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Spotlight (inproceedings)

Abstract
We address the problem of estimating human body shape from 3D scans over time. Reliable estimation of 3D body shape is necessary for many applications including virtual try-on, health monitoring, and avatar creation for virtual reality. Scanning bodies in minimal clothing, however, presents a practical barrier to these applications. We address this problem by estimating body shape under clothing from a sequence of 3D scans. Previous methods that have exploited statistical models of body shape produce overly smooth shapes lacking personalized details. In this paper we contribute a new approach to recover not only an approximate shape of the person, but also their detailed shape. Our approach allows the estimated shape to deviate from a parametric model to fit the 3D scans. We demonstrate the method using high quality 4D data as well as sequences of visual hulls extracted from multi-view images. We also make available a new high quality 4D dataset that enables quantitative evaluation. Our method outperforms the previous state of the art, both qualitatively and quantitatively.

arxiv_preprint video dataset pdf supplemental [BibTex]

arxiv_preprint video dataset pdf supplemental [BibTex]


Thumb md web teaser eg
Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs

(Best Paper, Eurographics 2017)

von Marcard, T., Rosenhahn, B., Black, M. J., Pons-Moll, G.

Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics), 2017 (article)

Abstract
We address the problem of making human motion capture in the wild more practical by using a small set of inertial sensors attached to the body. Since the problem is heavily under-constrained, previous methods either use a large number of sensors, which is intrusive, or they require additional video input. We take a different approach and constrain the problem by: (i) making use of a realistic statistical body model that includes anthropometric constraints and (ii) using a joint optimization framework to fit the model to orientation and acceleration measurements over multiple frames. The resulting tracker Sparse Inertial Poser (SIP) enables motion capture using only 6 sensors (attached to the wrists, lower legs, back and head) and works for arbitrary human motions. Experiments on the recently released TNT15 dataset show that, using the same number of sensors, SIP achieves higher accuracy than the dataset baseline without using any video data. We further demonstrate the effectiveness of SIP on newly recorded challenging motions in outdoor scenarios such as climbing or jumping over a wall

video pdf [BibTex]

video pdf [BibTex]


Thumb md coverhand wilson
Capturing Hand-Object Interaction and Reconstruction of Manipulated Objects

Tzionas, D.

University of Bonn, 2017 (phdthesis)

Abstract
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however, even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priori knowledge of the object's shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning effectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically using RGB-D data. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.

Thesis link (url) Project Page [BibTex]


Thumb md img03
OctNet: Learning Deep 3D Representations at High Resolutions

Riegler, G., Ulusoy, O., Geiger, A.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.

pdf suppmat Project Page Video [BibTex]

pdf suppmat Project Page Video [BibTex]


Thumb md teaser
Coupling Adaptive Batch Sizes with Learning Rates

Balles, L., Romero, J., Hennig, P.

In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI), pages: 410-419, (Editors: Gal Elidan and Kristian Kersting), 2017 (inproceedings)

Abstract
Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On three image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.

Code link (url) Project Page [BibTex]

Code link (url) Project Page [BibTex]


Thumb md mosh heroes icon
Method for providing a three dimensional body model

Loper, M., Mahmood, N., Black, M. J.

U.S. Patent 9,710,964 B2., July 2017 (patent)

Abstract
A method for providing a three-dimensional body model which may be applied for an animation, based on a moving body, wherein the method comprises providing a parametric three-dimensional body model, which allows shape and pose variations; applying a standard set of body markers; optimizing the set of body markers by generating an additional set of body markers and applying the same for providing 3D coordinate marker signals for capturing shape and pose of the body and dynamics of soft tissue; and automatically providing an animation by processing the 3D coordinate marker signals in order to provide a personalized three-dimensional body model, based on estimated shape and an estimated pose of the body by means of predicted marker locations.

Google Patents MoSh Project [BibTex]

2016


Thumb md thumb
Barrista - Caffe Well-Served

Lassner, C., Kappler, D., Kiefel, M., Gehler, P.

ACM Multimedia Open Source Software Competition, ACM OSSC16, October 2016 (proceedings) Accepted

Abstract
The caffe framework is one of the leading deep learning toolboxes in the machine learning and computer vision community. While it offers efficiency and configurability, it falls short of a full interface to Python. With increasingly involved procedures for training deep networks and reaching depths of hundreds of layers, creating configuration files and keeping them consistent becomes an error prone process. We introduce the barrista framework, offering full, pythonic control over caffe. It separates responsibilities and offers code to solve frequently occurring tasks for pre-processing, training and model inspection. It is compatible to all caffe versions since mid 2015 and can import and export .prototxt files. Examples are included, e.g., a deep residual network implemented in only 172 lines (for arbitrary depths), comparing to 2320 lines in the official implementation for the equivalent model.

pdf link (url) DOI [BibTex]

2016

pdf link (url) DOI [BibTex]


Thumb md teaser
Deep Discrete Flow

Güney, F., Geiger, A.

Asian Conference on Computer Vision (ACCV), 2016 (conference) Accepted

pdf suppmat [BibTex]

pdf suppmat [BibTex]


Thumb md website thumbnail
Reconstructing Articulated Rigged Models from RGB-D Videos

Tzionas, D., Gall, J.

European Conference on Computer Vision Workshops 2016 (ECCVW’16) - Workshop on Recovering 6D Object Pose (R6D’16), 2016 (proceedings)

Abstract
Although commercial and open-source software exist to reconstruct a static object from a sequence recorded with an RGB-D sensor, there is a lack of tools that build rigged models of articulated objects that deform realistically and can be used for tracking or animation. In this work, we fill this gap and propose a method that creates a fully rigged model of an articulated object from depth data of a single sensor. To this end, we combine deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow. The fully rigged model then consists of a watertight mesh, embedded skeleton, and skinning weights.

pdf suppl Project's Website YouTube link (url) DOI [BibTex]

pdf suppl Project's Website YouTube link (url) DOI [BibTex]


Thumb md gadde
Superpixel Convolutional Networks using Bilateral Inceptions

Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.

In European Conference on Computer Vision (ECCV), Lecture Notes in Computer Science, Springer, October 2016 (inproceedings)

Abstract
In this paper we propose a CNN architecture for semantic image segmentation. We introduce a new “bilateral inception” module that can be inserted in existing CNN architectures and performs bilateral filtering, at multiple feature-scales, between superpixels in an image. The feature spaces for bilateral filtering and other parameters of the module are learned end-to-end using standard backpropagation techniques. The bilateral inception module addresses two issues that arise with general CNN segmentation architectures. First, this module propagates information between (super) pixels while respecting image edges, thus using the structured information of the problem for improved results. Second, the layer recovers a full resolution segmentation result from the lower resolution solution of a CNN. In the experiments, we modify several existing CNN architectures by inserting our inception modules between the last CNN (1 × 1 convolution) layers. Empirical results on three different datasets show reliable improvements not only in comparison to the baseline networks, but also in comparison to several dense-pixel prediction techniques such as CRFs, while being competitive in time.

pdf supplementary poster [BibTex]

pdf supplementary poster [BibTex]


Thumb md smplify
Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., J. Black, M.

In Computer Vision – ECCV 2016, Lecture Notes in Computer Science, Springer International Publishing, October 2016 (inproceedings)

Abstract
We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we fi rst use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fi t it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.

pdf Video Sup Mat video Code Project [BibTex]

pdf Video Sup Mat video Code Project [BibTex]


Thumb md screen shot 2016 07 25 at 13.52.05
Non-parametric Models for Structured Data and Applications to Human Bodies and Natural Scenes

Lehrmann, A.

ETH Zurich, July 2016 (phdthesis)

Abstract
The purpose of this thesis is the study of non-parametric models for structured data and their fields of application in computer vision. We aim at the development of context-sensitive architectures which are both expressive and efficient. Our focus is on directed graphical models, in particular Bayesian networks, where we combine the flexibility of non-parametric local distributions with the efficiency of a global topology with bounded treewidth. A bound on the treewidth is obtained by either constraining the maximum indegree of the underlying graph structure or by introducing determinism. The non-parametric distributions in the nodes of the graph are given by decision trees or kernel density estimators. The information flow implied by specific network topologies, especially the resultant (conditional) independencies, allows for a natural integration and control of contextual information. We distinguish between three different types of context: static, dynamic, and semantic. In four different approaches we propose models which exhibit varying combinations of these contextual properties and allow modeling of structured data in space, time, and hierarchies derived thereof. The generative character of the presented models enables a direct synthesis of plausible hypotheses. Extensive experiments validate the developed models in two application scenarios which are of particular interest in computer vision: human bodies and natural scenes. In the practical sections of this work we discuss both areas from different angles and show applications of our models to human pose, motion, and segmentation as well as object categorization and localization. Here, we benefit from the availability of modern datasets of unprecedented size and diversity. Comparisons to traditional approaches and state-of-the-art research on the basis of well-established evaluation criteria allows the objective assessment of our contributions.

pdf [BibTex]


Thumb md psychscience
Creating body shapes from verbal descriptions by linking similarity spaces

Q. Hill, M., Streuber, S., A. Hahn, C., J. Black, M., J. O’Toole, A.

Psychological Science, 27(11):1486-1497, November 2016, (article)

Abstract
Brief verbal descriptions of bodies (e.g. curvy, long-legged) can elicit vivid mental images. The ease with which we create these mental images belies the complexity of three-dimensional body shapes. We explored the relationship between body shapes and body descriptions and show that a small number of words can be used to generate categorically accurate representations of three-dimensional bodies. The dimensions of body shape variation that emerged in a language-based similarity space were related to major dimensions of variation computed directly from three-dimensional laser scans of 2094 bodies. This allowed us to generate three-dimensional models of people in the shape space using only their coordinates on analogous dimensions in the language-based description space. Human descriptions of photographed bodies and their corresponding models matched closely. The natural mapping between the spaces illustrates the role of language as a concise code for body shape, capturing perceptually salient global and local body features.

pdf Project Page [BibTex]

pdf Project Page [BibTex]


Thumb md siyong
Shape estimation of subcutaneous adipose tissue using an articulated statistical shape model

Yeo, S., Romero, J., Loper, M., Machann, J., Black, M. J.

Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 0(0):1-8, 2016 (article)

publisher website preprint pdf link (url) DOI [BibTex]