Header logo is ps


2017


Thumb xl teasercrop
A Generative Model of People in Clothing

Lassner, C., Pons-Moll, G., Gehler, P. V.

In Proceedings IEEE International Conference on Computer Vision (ICCV), IEEE, Piscataway, NJ, USA, October 2017 (inproceedings)

Abstract
We present the first image-based generative model of people in clothing in a full-body setting. We sidestep the commonly used complex graphics rendering pipeline and the need for high-quality 3D scans of dressed people. Instead, we learn generative models from a large image database. The main challenge is to cope with the high variance in human pose, shape and appearance. For this reason, pure image-based approaches have not been considered so far. We show that this challenge can be overcome by splitting the generating process in two parts. First, we learn to generate a semantic segmentation of the body and clothing. Second, we learn a conditional model on the resulting segments that creates realistic images. The full model is differentiable and can be conditioned on pose, shape or color. The result are samples of people in different clothing items and styles. The proposed model can generate entirely new people with realistic clothing. In several experiments we present encouraging results that suggest an entirely data-driven approach to people generation is possible.

link (url) Project Page [BibTex]

2017

link (url) Project Page [BibTex]


Thumb xl website teaser
Semantic Video CNNs through Representation Warping

Gadde, R., Jampani, V., Gehler, P. V.

In Proceedings IEEE International Conference on Computer Vision (ICCV), IEEE, Piscataway, NJ, USA, October 2017 (inproceedings) Accepted

Abstract
In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very lit- tle extra computational cost. This module is called Net- Warp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network repre- sentations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to- end training. Experiments validate that the proposed ap- proach incurs only little extra computational cost, while im- proving performance, when video streams are available. We achieve new state-of-the-art results on the standard CamVid and Cityscapes benchmark datasets and show reliable im- provements over different baseline networks. Our code and models are available at http://segmentation.is. tue.mpg.de

pdf Supplementary Project Page [BibTex]

pdf Supplementary Project Page [BibTex]


Thumb xl screen shot 2017 08 09 at 12.54.00
A simple yet effective baseline for 3d human pose estimation

Martinez, J., Hossain, R., Romero, J., Little, J. J.

In Proceedings IEEE International Conference on Computer Vision (ICCV), IEEE, Piscataway, NJ, USA, October 2017 (inproceedings)

Abstract
Following the success of deep convolutional networks, state-of-the-art methods for 3d human pose estimation have focused on deep end-to-end systems that predict 3d joint locations given raw image pixels. Despite their excellent performance, it is often not easy to understand whether their remaining error stems from a limited 2d pose (visual) understanding, or from a failure to map 2d poses into 3-dimensional positions. With the goal of understanding these sources of error, we set out to build a system that given 2d joint locations predicts 3d positions. Much to our surprise, we have found that, with current technology, "lifting" ground truth 2d joint locations to 3d space is a task that can be solved with a remarkably low error rate: a relatively simple deep feed-forward network outperforms the best reported result by about 30\% on Human3.6M, the largest publicly available 3d pose estimation benchmark. Furthermore, training our system on the output of an off-the-shelf state-of-the-art 2d detector (\ie, using images as input) yields state of the art results -- this includes an array of systems that have been trained end-to-end specifically for this task. Our results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation.

video code arxiv pdf preprint Project Page [BibTex]

video code arxiv pdf preprint Project Page [BibTex]


Thumb xl kenny
Effects of animation retargeting on perceived action outcomes

Kenny, S., Mahmood, N., Honda, C., Black, M. J., Troje, N. F.

Proceedings of the ACM Symposium on Applied Perception (SAP’17), pages: 2:1-2:7, September 2017 (conference)

Abstract
The individual shape of the human body, including the geometry of its articulated structure and the distribution of weight over that structure, influences the kinematics of a person's movements. How sensitive is the visual system to inconsistencies between shape and motion introduced by retargeting motion from one person onto the shape of another? We used optical motion capture to record five pairs of male performers with large differences in body weight, while they pushed, lifted, and threw objects. Based on a set of 67 markers, we estimated both the kinematics of the actions as well as the performer's individual body shape. To obtain consistent and inconsistent stimuli, we created animated avatars by combining the shape and motion estimates from either a single performer or from different performers. In a virtual reality environment, observers rated the perceived weight or thrown distance of the objects. They were also asked to explicitly discriminate between consistent and hybrid stimuli. Observers were unable to accomplish the latter, but hybridization of shape and motion influenced their judgements of action outcome in systematic ways. Inconsistencies between shape and motion were assimilated into an altered perception of the action outcome.

pdf DOI [BibTex]

pdf DOI [BibTex]


Thumb xl teaser
Coupling Adaptive Batch Sizes with Learning Rates

Balles, L., Romero, J., Hennig, P.

In Proceedings Conference on Uncertainty in Artificial Intelligence (UAI) 2017, pages: 410-419, (Editors: Gal Elidan and Kristian Kersting), Association for Uncertainty in Artificial Intelligence (AUAI), August 2017 (inproceedings)

Abstract
Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On three image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.

Code link (url) Project Page [BibTex]

Code link (url) Project Page [BibTex]


Thumb xl 1611.04399 image
Joint Graph Decomposition and Node Labeling by Local Search

Levinkov, E., Uhrig, J., Tang, S., Omran, M., Insafutdinov, E., Kirillov, A., Rother, C., Brox, T., Schiele, B., Andres, B.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 1904-1912, IEEE, July 2017 (inproceedings)

PDF Supplementary DOI Project Page [BibTex]

PDF Supplementary DOI Project Page [BibTex]


Thumb xl teaser
Dynamic FAUST: Registering Human Bodies in Motion

Bogo, F., Romero, J., Pons-Moll, G., Black, M. J.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
While the ready availability of 3D scan data has influenced research throughout computer vision, less attention has focused on 4D data; that is 3D scans of moving nonrigid objects, captured over time. To be useful for vision research, such 4D scans need to be registered, or aligned, to a common topology. Consequently, extending mesh registration methods to 4D is important. Unfortunately, no ground-truth datasets are available for quantitative evaluation and comparison of 4D registration methods. To address this we create a novel dataset of high-resolution 4D scans of human subjects in motion, captured at 60 fps. We propose a new mesh registration method that uses both 3D geometry and texture information to register all scans in a sequence to a common reference topology. The approach exploits consistency in texture over both short and long time intervals and deals with temporal offsets between shape and texture capture. We show how using geometry alone results in significant errors in alignment when the motions are fast and non-rigid. We evaluate the accuracy of our registration and provide a dataset of 40,000 raw and aligned meshes. Dynamic FAUST extends the popular FAUST dataset to dynamic 4D data, and is available for research purposes at http://dfaust.is.tue.mpg.de.

pdf video Project Page Project Page Project Page [BibTex]

pdf video Project Page Project Page Project Page [BibTex]


Thumb xl surrealin
Learning from Synthetic Humans

Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., Schmid, C.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

arXiv project data Project Page Project Page [BibTex]

arXiv project data Project Page Project Page [BibTex]


Thumb xl martinez
On human motion prediction using recurrent neural networks

Martinez, J., Black, M. J., Romero, J.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.

arXiv Project Page [BibTex]

arXiv Project Page [BibTex]


Thumb xl untitled
Articulated Multi-person Tracking in the Wild

Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B., Schiele, B.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 1293-1301, IEEE, July 2017, Oral (inproceedings)

DOI [BibTex]

DOI [BibTex]


Thumb xl joel slow flow crop
Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data

Janai, J., Güney, F., Wulff, J., Black, M., Geiger, A.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 1406-1416, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.

pdf suppmat Project page Video DOI Project Page [BibTex]

pdf suppmat Project page Video DOI Project Page [BibTex]


Thumb xl mrflow
Optical Flow in Mostly Rigid Scenes

Wulff, J., Sevilla-Lara, L., Black, M. J.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 6911-6920, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
The optical flow of natural scenes is a combination of the motion of the observer and the independent motion of objects. Existing algorithms typically focus on either recovering motion and structure under the assumption of a purely static world or optical flow for general unconstrained scenes. We combine these approaches in an optical flow algorithm that estimates an explicit segmentation of moving objects from appearance and physical constraints. In static regions we take advantage of strong constraints to jointly estimate the camera motion and the 3D structure of the scene over multiple frames. This allows us to also regularize the structure instead of the motion. Our formulation uses a Plane+Parallax framework, which works even under small baselines, and reduces the motion estimation to a one-dimensional search problem, resulting in more accurate estimation. In moving regions the flow is treated as unconstrained, and computed with an existing optical flow method. The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art results on both the MPISintel and KITTI-2015 benchmarks.

pdf SupMat video code Project Page [BibTex]

pdf SupMat video code Project Page [BibTex]


Thumb xl img03
OctNet: Learning Deep 3D Representations at High Resolutions

Riegler, G., Ulusoy, O., Geiger, A.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.

pdf suppmat Project Page Video Project Page [BibTex]

pdf suppmat Project Page Video Project Page [BibTex]


Thumb xl 71341 r guided
Reflectance Adaptive Filtering Improves Intrinsic Image Estimation

Nestmeyer, T., Gehler, P. V.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 1771-1780, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

pre-print DOI Project Page Project Page [BibTex]

pre-print DOI Project Page Project Page [BibTex]


Thumb xl web teaser
Detailed, accurate, human shape estimation from clothed 3D scan sequences

Zhang, C., Pujades, S., Black, M., Pons-Moll, G.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Washington, DC, USA, July 2017, Spotlight (inproceedings)

Abstract
We address the problem of estimating human body shape from 3D scans over time. Reliable estimation of 3D body shape is necessary for many applications including virtual try-on, health monitoring, and avatar creation for virtual reality. Scanning bodies in minimal clothing, however, presents a practical barrier to these applications. We address this problem by estimating body shape under clothing from a sequence of 3D scans. Previous methods that have exploited statistical models of body shape produce overly smooth shapes lacking personalized details. In this paper we contribute a new approach to recover not only an approximate shape of the person, but also their detailed shape. Our approach allows the estimated shape to deviate from a parametric model to fit the 3D scans. We demonstrate the method using high quality 4D data as well as sequences of visual hulls extracted from multi-view images. We also make available a new high quality 4D dataset that enables quantitative evaluation. Our method outperforms the previous state of the art, both qualitatively and quantitatively.

arxiv_preprint video dataset pdf supplemental DOI Project Page [BibTex]

arxiv_preprint video dataset pdf supplemental DOI Project Page [BibTex]


Thumb xl slide1
3D Menagerie: Modeling the 3D Shape and Pose of Animals

Zuffi, S., Kanazawa, A., Jacobs, D., Black, M. J.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 5524-5532, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
There has been significant work on learning realistic, articulated, 3D models of the human body. In contrast, there are few such models of animals, despite many applications. The main challenge is that animals are much less cooperative than humans. The best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. Consequently, we learn our model from a small set of 3D scans of toy figurines in arbitrary poses. We employ a novel part-based shape model to compute an initial registration to the scans. We then normalize their pose, learn a statistical shape model, and refine the registrations and the model together. In this way, we accurately align animal scans from different quadruped families with very different shapes and poses. With the registration to a common template we learn a shape space representing animals including lions, cats, dogs, horses, cows and hippos. Animal shapes can be sampled from the model, posed, animated, and fit to data. We demonstrate generalization by fitting it to images of real animals including species not seen in training.

pdf video Project Page [BibTex]

pdf video Project Page [BibTex]


Thumb xl pyramid
Optical Flow Estimation using a Spatial Pyramid Network

Ranjan, A., Black, M.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.

pdf SupMat project/code [BibTex]

pdf SupMat project/code [BibTex]


Thumb xl imgidx 00197
Multiple People Tracking by Lifted Multicut and Person Re-identification

Tang, S., Andriluka, M., Andres, B., Schiele, B.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 3701-3710, IEEE Computer Society, Washington, DC, USA, July 2017 (inproceedings)

DOI Project Page [BibTex]

DOI Project Page [BibTex]


Thumb xl vpn teaser
Video Propagation Networks

Jampani, V., Gadde, R., Gehler, P. V.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

pdf supplementary arXiv project page code Project Page [BibTex]

pdf supplementary arXiv project page code Project Page [BibTex]


Thumb xl anja
Generating Descriptions with Grounded and Co-Referenced People

Rohrbach, A., Rohrbach, M., Tang, S., Oh, S. J., Schiele, B.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 4196-4206, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

PDF DOI Project Page [BibTex]

PDF DOI Project Page [BibTex]


Thumb xl cvpr2017 landpsace
Semantic Multi-view Stereo: Jointly Estimating Objects and Voxels

Ulusoy, A. O., Black, M. J., Geiger, A.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
Dense 3D reconstruction from RGB images is a highly ill-posed problem due to occlusions, textureless or reflective surfaces, as well as other challenges. We propose object-level shape priors to address these ambiguities. Towards this goal, we formulate a probabilistic model that integrates multi-view image evidence with 3D shape information from multiple objects. Inference in this model yields a dense 3D reconstruction of the scene as well as the existence and precise 3D pose of the objects in it. Our approach is able to recover fine details not captured in the input shapes while defaulting to the input models in occluded regions where image evidence is weak. Due to its probabilistic nature, the approach is able to cope with the approximate geometry of the 3D models as well as input shapes that are not present in the scene. We evaluate the approach quantitatively on several challenging indoor and outdoor datasets.

YouTube pdf suppmat Project Page [BibTex]

YouTube pdf suppmat Project Page [BibTex]


Thumb xl judith
Deep representation learning for human motion prediction and classification

Bütepage, J., Black, M., Kragic, D., Kjellström, H.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
Generative models of 3D human motion are often restricted to a small number of activities and can therefore not generalize well to novel movements or applications. In this work we propose a deep learning framework for human motion capture data that learns a generic representation from a large corpus of motion capture data and generalizes well to new, unseen, motions. Using an encoding-decoding network that learns to predict future 3D poses from the most recent past, we extract a feature representation of human motion. Most work on deep learning for sequence prediction focuses on video and speech. Since skeletal data has a different structure, we present and evaluate different network architectures that make different assumptions about time dependencies and limb correlations. To quantify the learned features, we use the output of different layers for action classification and visualize the receptive fields of the network units. Our method outperforms the recent state of the art in skeletal motion prediction even though these use action specific training data. Our results show that deep feedforward networks, trained from a generic mocap database, can successfully be used for feature extraction from human motion data and that this representation can be used as a foundation for classification and prediction.

arXiv Project Page [BibTex]

arXiv Project Page [BibTex]


Thumb xl teasercrop
Unite the People: Closing the Loop Between 3D and 2D Human Representations

Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M. J., Gehler, P. V.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)

Abstract
3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits “in-the-wild”. However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-of-the art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.

arXiv project/code/data Project Page [BibTex]

arXiv project/code/data Project Page [BibTex]


Thumb xl image  1
Human Shape Estimation using Statistical Body Models

Loper, M. M.

University of Tübingen, May 2017 (thesis)

Abstract
Human body estimation methods transform real-world observations into predictions about human body state. These estimation methods benefit a variety of health, entertainment, clothing, and ergonomics applications. State may include pose, overall body shape, and appearance. Body state estimation is underconstrained by observations; ambiguity presents itself both in the form of missing data within observations, and also in the form of unknown correspondences between observations. We address this challenge with the use of a statistical body model: a data-driven virtual human. This helps resolve ambiguity in two ways. First, it fills in missing data, meaning that incomplete observations still result in complete shape estimates. Second, the model provides a statistically-motivated penalty for unlikely states, which enables more plausible body shape estimates. Body state inference requires more than a body model; we therefore build obser- vation models whose output is compared with real observations. In this thesis, body state is estimated from three types of observations: 3D motion capture markers, depth and color images, and high-resolution 3D scans. In each case, a forward process is proposed which simulates observations. By comparing observations to the results of the forward process, state can be adjusted to minimize the difference between simulated and observed data. We use gradient-based methods because they are critical to the precise estimation of state with a large number of parameters. The contributions of this work include three parts. First, we propose a method for the estimation of body shape, nonrigid deformation, and pose from 3D markers. Second, we present a concise approach to differentiating through the rendering process, with application to body shape estimation. And finally, we present a statistical body model trained from human body scans, with state-of-the-art fidelity, good runtime performance, and compatibility with existing animation packages.

Official Version [BibTex]


Thumb xl appealingavatars
Appealing Avatars from 3D Body Scans: Perceptual Effects of Stylization

Fleming, R., Mohler, B. J., Romero, J., Black, M. J., Breidt, M.

In Computer Vision, Imaging and Computer Graphics Theory and Applications: 11th International Joint Conference, VISIGRAPP 2016, Rome, Italy, February 27 – 29, 2016, Revised Selected Papers, pages: 175-196, Springer International Publishing, 2017 (inbook)

Abstract
Using styles derived from existing popular character designs, we present a novel automatic stylization technique for body shape and colour information based on a statistical 3D model of human bodies. We investigate whether such stylized body shapes result in increased perceived appeal with two different experiments: One focuses on body shape alone, the other investigates the additional role of surface colour and lighting. Our results consistently show that the most appealing avatar is a partially stylized one. Importantly, avatars with high stylization or no stylization at all were rated to have the least appeal. The inclusion of colour information and improvements to render quality had no significant effect on the overall perceived appeal of the avatars, and we observe that the body shape primarily drives the change in appeal ratings. For body scans with colour information, we found that a partially stylized avatar was perceived as most appealing.

publisher site pdf DOI [BibTex]

publisher site pdf DOI [BibTex]


Thumb xl gcpr2017 nugget
Learning to Filter Object Detections

Prokudin, S., Kappler, D., Nowozin, S., Gehler, P.

In Pattern Recognition: 39th German Conference, GCPR 2017, Basel, Switzerland, September 12–15, 2017, Proceedings, pages: 52-62, Springer International Publishing, Cham, 2017 (inbook)

Abstract
Most object detection systems consist of three stages. First, a set of individual hypotheses for object locations is generated using a proposal generating algorithm. Second, a classifier scores every generated hypothesis independently to obtain a multi-class prediction. Finally, all scored hypotheses are filtered via a non-differentiable and decoupled non-maximum suppression (NMS) post-processing step. In this paper, we propose a filtering network (FNet), a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image. This formulation enables end-to-end training of the full object detection pipeline. First, we demonstrate that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS. We further analyze NMS failures and propose a loss formulation that is better aligned with the mean average precision (mAP) evaluation metric. We evaluate FNet on several standard detection datasets. Results surpass standard NMS on highly occluded settings of a synthetic overlapping MNIST dataset and show competitive behavior on PascalVOC2007 and KITTI detection benchmarks.

Paper link (url) DOI Project Page [BibTex]

Paper link (url) DOI Project Page [BibTex]


Thumb xl phd thesis teaser
Learning Inference Models for Computer Vision

Jampani, V.

MPI for Intelligent Systems and University of Tübingen, 2017 (phdthesis)

Abstract
Computer vision can be understood as the ability to perform 'inference' on image data. Breakthroughs in computer vision technology are often marked by advances in inference techniques, as even the model design is often dictated by the complexity of inference in them. This thesis proposes learning based inference schemes and demonstrates applications in computer vision. We propose techniques for inference in both generative and discriminative computer vision models. Despite their intuitive appeal, the use of generative models in vision is hampered by the difficulty of posterior inference, which is often too complex or too slow to be practical. We propose techniques for improving inference in two widely used techniques: Markov Chain Monte Carlo (MCMC) sampling and message-passing inference. Our inference strategy is to learn separate discriminative models that assist Bayesian inference in a generative model. Experiments on a range of generative vision models show that the proposed techniques accelerate the inference process and/or converge to better solutions. A main complication in the design of discriminative models is the inclusion of prior knowledge in a principled way. For better inference in discriminative models, we propose techniques that modify the original model itself, as inference is simple evaluation of the model. We concentrate on convolutional neural network (CNN) models and propose a generalization of standard spatial convolutions, which are the basic building blocks of CNN architectures, to bilateral convolutions. First, we generalize the existing use of bilateral filters and then propose new neural network architectures with learnable bilateral filters, which we call `Bilateral Neural Networks'. We show how the bilateral filtering modules can be used for modifying existing CNN architectures for better image segmentation and propose a neural network approach for temporal information propagation in videos. Experiments demonstrate the potential of the proposed bilateral networks on a wide range of vision tasks and datasets. In summary, we propose learning based techniques for better inference in several computer vision models ranging from inverse graphics to freely parameterized neural networks. In generative vision models, our inference techniques alleviate some of the crucial hurdles in Bayesian posterior inference, paving new ways for the use of model based machine learning in vision. In discriminative CNN models, the proposed filter generalizations aid in the design of new neural network architectures that can handle sparse high-dimensional data as well as provide a way for incorporating prior knowledge into CNNs.

pdf [BibTex]

pdf [BibTex]


Thumb xl muvs
Towards Accurate Marker-less Human Shape and Pose Estimation over Time

Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P. V., Romero, J., Akhter, I., Black, M. J.

In International Conference on 3D Vision (3DV), pages: 421-430, 2017 (inproceedings)

Abstract
Existing markerless motion capture methods often assume known backgrounds, static cameras, and sequence specific motion priors, limiting their application scenarios. Here we present a fully automatic method that, given multiview videos, estimates 3D human pose and body shape. We take the recently proposed SMPLify method [12] as the base method and extend it in several ways. First we fit a 3D human body model to 2D features detected in multi-view images. Second, we use a CNN method to segment the person in each image and fit the 3D body model to the contours, further improving accuracy. Third we utilize a generic and robust DCT temporal prior to handle the left and right side swapping issue sometimes introduced by the 2D pose estimator. Validation on standard benchmarks shows our results are comparable to the state of the art and also provide a realistic 3D shape avatar. We also demonstrate accurate results on HumanEva and on challenging monocular sequences of dancing from YouTube.

Code pdf DOI Project Page [BibTex]


Thumb xl auroteaser
Decentralized Simultaneous Multi-target Exploration using a Connected Network of Multiple Robots

Nestmeyer, T., Robuffo Giordano, P., Bülthoff, H. H., Franchi, A.

In pages: 989-1011, Autonomous Robots, 2017 (incollection)

[BibTex]

[BibTex]


Thumb xl coverhand wilson
Capturing Hand-Object Interaction and Reconstruction of Manipulated Objects

Tzionas, D.

University of Bonn, 2017 (phdthesis)

Abstract
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however, even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priori knowledge of the object's shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning effectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically using RGB-D data. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.

Thesis link (url) Project Page [BibTex]

2009


Thumb xl teaser wacv2010
Ball Joints for Marker-less Human Motion Capture

Pons-Moll, G., Rosenhahn, B.

In IEEE Workshop on Applications of Computer Vision (WACV),, December 2009 (inproceedings)

pdf [BibTex]

2009

pdf [BibTex]


no image
Background Subtraction Based on Rank Constraint for Point Trajectories

Ahmad, A., Del Bue, A., Lima, P.

In pages: 1-3, October 2009 (inproceedings)

Abstract
This work deals with a background subtraction algorithm for a fish-eye lens camera having 3 degrees of freedom, 2 in translation and 1 in rotation. The core assumption in this algorithm is that the background is considered to be composed of a dominant static plane in the world frame. The novelty lies in developing a rank-constraint based background subtraction for equidistant projection model, a property of the fish-eye lens. A detail simulation result is presented to support the hypotheses explained in this paper.

link (url) [BibTex]

link (url) [BibTex]


Thumb xl teaser cinc
Parametric Modeling of the Beating Heart with Respiratory Motion Extracted from Magnetic Resonance Images

Pons-Moll, G., Crosas, C., Tadmor, G., MacLeod, R., Rosenhahn, B., Brooks, D.

In IEEE Computers in Cardiology (CINC), September 2009 (inproceedings)

[BibTex]

[BibTex]


Thumb xl ascc09
Computer cursor control by motor cortical signals in humans with tetraplegia

Kim, S., Simeral, J. D., Hochberg, L. R., Donoghue, J. P., Black, M. J.

In 7th Asian Control Conference, ASCC09, pages: 988-993, Hong Kong, China, August 2009 (inproceedings)

pdf [BibTex]

pdf [BibTex]


no image
Classification of colon polyps in NBI endoscopy using vascularization features

Stehle, T., Auer, R., Gross, S., Behrens, A., Wulff, J., Aach, T., Winograd, R., Trautwein, C., Tischendorf, J.

In Medical Imaging 2009: Computer-Aided Diagnosis, 7260, (Editors: N. Karssemeijer and M. L. Giger), SPIE, February 2009 (inproceedings)

Abstract
The evolution of colon cancer starts with colon polyps. There are two different types of colon polyps, namely hyperplasias and adenomas. Hyperplasias are benign polyps which are known not to evolve into cancer and, therefore, do not need to be removed. By contrast, adenomas have a strong tendency to become malignant. Therefore, they have to be removed immediately via polypectomy. For this reason, a method to differentiate reliably adenomas from hyperplasias during a preventive medical endoscopy of the colon (colonoscopy) is highly desirable. A recent study has shown that it is possible to distinguish both types of polyps visually by means of their vascularization. Adenomas exhibit a large amount of blood vessel capillaries on their surface whereas hyperplasias show only few of them. In this paper, we show the feasibility of computer-based classification of colon polyps using vascularization features. The proposed classification algorithm consists of several steps: For the critical part of vessel segmentation, we implemented and compared two segmentation algorithms. After a skeletonization of the detected blood vessel candidates, we used the results as seed points for the Fast Marching algorithm which is used to segment the whole vessel lumen. Subsequently, features are computed from this segmentation which are then used to classify the polyps. In leave-one-out tests on our polyp database (56 polyps), we achieve a correct classification rate of approximately 90%.

DOI [BibTex]

DOI [BibTex]


Thumb xl 3dim09
One-shot scanning using de bruijn spaced grids

Ulusoy, A., Calakli, F., Taubin, G.

In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages: 1786-1792, IEEE, 2009 (inproceedings)

Abstract
In this paper we present a new one-shot method to reconstruct the shape of dynamic 3D objects and scenes based on active illumination. In common with other related prior-art methods, a static grid pattern is projected onto the scene, a video sequence of the illuminated scene is captured, a shape estimate is produced independently for each video frame, and the one-shot property is realized at the expense of space resolution. The main challenge in grid-based one-shot methods is to engineer the pattern and algorithms so that the correspondence between pattern grid points and their images can be established very fast and without uncertainty. We present an efficient one-shot method which exploits simple geometric constraints to solve the correspondence problem. We also introduce De Bruijn spaced grids, a novel grid pattern, and show with strong empirical data that the resulting scheme is much more robust compared to those based on uniform spaced grids.

pdf link (url) DOI [BibTex]

pdf link (url) DOI [BibTex]


no image
An introduction to Kernel Learning Algorithms

Gehler, P., Schölkopf, B.

In Kernel Methods for Remote Sensing Data Analysis, pages: 25-48, 2, (Editors: Gustavo Camps-Valls and Lorenzo Bruzzone), Wiley, New York, NY, USA, 2009 (inbook)

Abstract
Kernel learning algorithms are currently becoming a standard tool in the area of machine learning and pattern recognition. In this chapter we review the fundamental theory of kernel learning. As the basic building block we introduce the kernel function, which provides an elegant and general way to compare possibly very complex objects. We then review the concept of a reproducing kernel Hilbert space and state the representer theorem. Finally we give an overview of the most prominent algorithms, which are support vector classification and regression, Gaussian Processes and kernel principal analysis. With multiple kernel learning and structured output prediction we also introduce some more recent advancements in the field.

link (url) DOI [BibTex]

link (url) DOI [BibTex]


Thumb xl iccv09
Estimating human shape and pose from a single image

Guan, P., Weiss, A., Balan, A., Black, M. J.

In Int. Conf. on Computer Vision, ICCV, pages: 1381-1388, 2009 (inproceedings)

pdf video - mov 25MB video - mp4 10MB YouTube Project Page [BibTex]

pdf video - mov 25MB video - mp4 10MB YouTube Project Page [BibTex]


Thumb xl screen shot 2012 02 21 at 15.56.00  2
On feature combination for multiclass object classification

Gehler, P., Nowozin, S.

In Proceedings of the Twelfth IEEE International Conference on Computer Vision, pages: 221-228, 2009, oral presentation (inproceedings)

project page, code, data GoogleScholar pdf DOI [BibTex]

project page, code, data GoogleScholar pdf DOI [BibTex]


Thumb xl tracking iccv09
Segmentation, Ordering and Multi-object Tracking Using Graphical Models

Wang, C., Gorce, M. D. L., Paragios, N.

In IEEE International Conference on Computer Vision (ICCV), 2009 (inproceedings)

pdf [BibTex]

pdf [BibTex]


no image
Visual Object Discovery

Sinha, P., Balas, B., Ostrovsky, Y., Wulff, J.

In Object Categorization: Computer and Human Vision Perspectives, pages: 301-323, (Editors: S. J. Dickinson, A. Leonardis, B. Schiele, M.J. Tarr), Cambridge University Press, 2009 (inbook)

link (url) [BibTex]

link (url) [BibTex]


no image
Evaluating the potential of primary motor and premotor cortex for mutltidimensional neuroprosthetic control of complete reaching and grasping actions

Vargas-Irwin, C. E., Yadollahpour, P., Shakhnarovich, G., Black, M. J., Donoghue, J. P.

2009 Abstract Viewer and Itinerary Planner. Society for Neuroscience, Society for Neuroscience, 2009, Online (conference)

[BibTex]

[BibTex]


Thumb xl thumb screen shot 2012 10 06 at 12.02.32 pm
Modeling and Evaluation of Human-to-Robot Mapping of Grasps

Romero, J., Kjellström, H., Kragic, D.

In International Conference on Advanced Robotics (ICAR), pages: 1-6, 2009 (inproceedings)

Pdf [BibTex]

Pdf [BibTex]


Thumb xl nips2009b
An additive latent feature model for transparent object recognition

Fritz, M., Black, M., Bradski, G., Karayev, S., Darrell, T.

In Advances in Neural Information Processing Systems 22, NIPS, pages: 558-566, MIT Press, 2009 (inproceedings)

pdf slides [BibTex]

pdf slides [BibTex]


Thumb xl screen shot 2012 06 06 at 11.24.14 am
Let the kernel figure it out; Principled learning of pre-processing for kernel classifiers

Gehler, P., Nowozin, S.

In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pages: 2836-2843, IEEE Computer Society, 2009 (inproceedings)

doi project page pdf [BibTex]

doi project page pdf [BibTex]


Thumb xl thumb screen shot 2012 10 06 at 12.04.52 pm
Monocular Real-Time 3D Articulated Hand Pose Estimation

Romero, J., Kjellström, H., Kragic, D.

In IEEE-RAS International Conference on Humanoid Robots, pages: 87-92, 2009 (inproceedings)

Pdf [BibTex]

Pdf [BibTex]


Thumb xl snap
Grasp Recognition and Mapping on Humanoid Robots

Do, M., Romero, J., Kjellström, H., Azad, P., Asfour, T., Kragic, D., Dillmann, R.

In IEEE-RAS International Conference on Humanoid Robots, pages: 465-471, 2009 (inproceedings)

Pdf Video [BibTex]

Pdf Video [BibTex]


Thumb xl teaser wc
4D Cardiac Segmentation of the Epicardium and Left Ventricle

Pons-Moll, G., Tadmor, G., MacLeod, R. S., Rosenhahn, B., Brooks, D. H.

In World Congress of Medical Physics and Biomedical Engineering (WC), 2009 (inproceedings)

[BibTex]

[BibTex]


Thumb xl bmvc1
Geometric Potential Force for the Deformable Model

Si Yong Yeo, Xianghua Xie, Igor Sazonov, Perumal Nithiarasu

In The 20th British Machine Vision Conference, pages: 1-11, 2009 (inproceedings)

Abstract
We propose a new external force field for deformable models which can be conve- niently generalized to high dimensions. The external force field is based on hypothesized interactions between the relative geometries of the deformable model and image gradi- ents. The evolution of the deformable model is solved using the level set method. The dynamic interaction forces between the geometries can greatly improve the deformable model performance in acquiring complex geometries and highly concave boundaries, and in dealing with weak image edges. The new deformable model can handle arbi- trary cross-boundary initializations. Here, we show that the proposed method achieve significant improvements when compared against existing state-of-the-art techniques.

[BibTex]

[BibTex]


Thumb xl cmbe
Level Set Based Automatic Segmentation of Human Aorta

Si Yong Yeo, Xianghua Xie, Igor Sazonov, Perumal Nithiarasu

In International Conference on Computational & Mathematical Biomedical Engineering, pages: 242-245, 2009 (inproceedings)

[BibTex]

[BibTex]