Header logo is ps


2018


Thumb xl thesis cover2
Model-based Optical Flow: Layers, Learning, and Geometry

Wulff, J.

Tuebingen University, April 2018 (phdthesis)

Abstract
The estimation of motion in video sequences establishes temporal correspondences between pixels and surfaces and allows reasoning about a scene using multiple frames. Despite being a focus of research for over three decades, computing motion, or optical flow, remains challenging due to a number of difficulties, including the treatment of motion discontinuities and occluded regions, and the integration of information from more than two frames. One reason for these issues is that most optical flow algorithms only reason about the motion of pixels on the image plane, while not taking the image formation pipeline or the 3D structure of the world into account. One approach to address this uses layered models, which represent the occlusion structure of a scene and provide an approximation to the geometry. The goal of this dissertation is to show ways to inject additional knowledge about the scene into layered methods, making them more robust, faster, and more accurate. First, this thesis demonstrates the modeling power of layers using the example of motion blur in videos, which is caused by fast motion relative to the exposure time of the camera. Layers segment the scene into regions that move coherently while preserving their occlusion relationships. The motion of each layer therefore directly determines its motion blur. At the same time, the layered model captures complex blur overlap effects at motion discontinuities. Using layers, we can thus formulate a generative model for blurred video sequences, and use this model to simultaneously deblur a video and compute accurate optical flow for highly dynamic scenes containing motion blur. Next, we consider the representation of the motion within layers. Since, in a layered model, important motion discontinuities are captured by the segmentation into layers, the flow within each layer varies smoothly and can be approximated using a low dimensional subspace. We show how this subspace can be learned from training data using principal component analysis (PCA), and that flow estimation using this subspace is computationally efficient. The combination of the layered model and the low-dimensional subspace gives the best of both worlds, sharp motion discontinuities from the layers and computational efficiency from the subspace. Lastly, we show how layered methods can be dramatically improved using simple semantics. Instead of treating all layers equally, a semantic segmentation divides the scene into its static parts and moving objects. Static parts of the scene constitute a large majority of what is shown in typical video sequences; yet, in such regions optical flow is fully constrained by the depth structure of the scene and the camera motion. After segmenting out moving objects, we consider only static regions, and explicitly reason about the structure of the scene and the camera motion, yielding much better optical flow estimates. Furthermore, computing the structure of the scene allows to better combine information from multiple frames, resulting in high accuracies even in occluded regions. For moving regions, we compute the flow using a generic optical flow method, and combine it with the flow computed for the static regions to obtain a full optical flow field. By combining layered models of the scene with reasoning about the dynamic behavior of the real, three-dimensional world, the methods presented herein push the envelope of optical flow computation in terms of robustness, speed, and accuracy, giving state-of-the-art results on benchmarks and pointing to important future research directions for the estimation of motion in natural scenes.

Official link DOI Project Page [BibTex]


Thumb xl coregpatentfig
Co-Registration – Simultaneous Alignment and Modeling of Articulated 3D Shapes

Black, M., Hirshberg, D., Loper, M., Rachlin, E., Weiss, A.

Febuary 2018, U.S.~Patent 9,898,848 (misc)

Abstract
Present application refers to a method, a model generation unit and a computer program (product) for generating trained models (M) of moving persons, based on physically measured person scan data (S). The approach is based on a common template (T) for the respective person and on the measured person scan data (S) in different shapes and different poses. Scan data are measured with a 3D laser scanner. A generic personal model is used for co-registering a set of person scan data (S) aligning the template (T) to the set of person scans (S) while simultaneously training the generic personal model to become a trained person model (M) by constraining the generic person model to be scan-specific, person-specific and pose-specific and providing the trained model (M), based on the co registering of the measured object scan data (S).

text [BibTex]

2017


Thumb xl bodytalk
Crowdshaping Realistic 3D Avatars with Words

Streuber, S., Ramirez, M. Q., Black, M., Zuffi, S., O’Toole, A., Hill, M. Q., Hahn, C. A.

August 2017, Application PCT/EP2017/051954 (misc)

Abstract
A method for generating a body shape, comprising the steps: - receiving one or more linguistic descriptors related to the body shape; - retrieving an association between the one or more linguistic descriptors and a body shape; and - generating the body shape, based on the association.

Google Patents [BibTex]

2017

Google Patents [BibTex]


Thumb xl image  1
Human Shape Estimation using Statistical Body Models

Loper, M. M.

University of Tübingen, May 2017 (thesis)

Abstract
Human body estimation methods transform real-world observations into predictions about human body state. These estimation methods benefit a variety of health, entertainment, clothing, and ergonomics applications. State may include pose, overall body shape, and appearance. Body state estimation is underconstrained by observations; ambiguity presents itself both in the form of missing data within observations, and also in the form of unknown correspondences between observations. We address this challenge with the use of a statistical body model: a data-driven virtual human. This helps resolve ambiguity in two ways. First, it fills in missing data, meaning that incomplete observations still result in complete shape estimates. Second, the model provides a statistically-motivated penalty for unlikely states, which enables more plausible body shape estimates. Body state inference requires more than a body model; we therefore build obser- vation models whose output is compared with real observations. In this thesis, body state is estimated from three types of observations: 3D motion capture markers, depth and color images, and high-resolution 3D scans. In each case, a forward process is proposed which simulates observations. By comparing observations to the results of the forward process, state can be adjusted to minimize the difference between simulated and observed data. We use gradient-based methods because they are critical to the precise estimation of state with a large number of parameters. The contributions of this work include three parts. First, we propose a method for the estimation of body shape, nonrigid deformation, and pose from 3D markers. Second, we present a concise approach to differentiating through the rendering process, with application to body shape estimation. And finally, we present a statistical body model trained from human body scans, with state-of-the-art fidelity, good runtime performance, and compatibility with existing animation packages.

Official Version [BibTex]


Thumb xl appealingavatars
Appealing Avatars from 3D Body Scans: Perceptual Effects of Stylization

Fleming, R., Mohler, B. J., Romero, J., Black, M. J., Breidt, M.

In Computer Vision, Imaging and Computer Graphics Theory and Applications: 11th International Joint Conference, VISIGRAPP 2016, Rome, Italy, February 27 – 29, 2016, Revised Selected Papers, pages: 175-196, Springer International Publishing, 2017 (inbook)

Abstract
Using styles derived from existing popular character designs, we present a novel automatic stylization technique for body shape and colour information based on a statistical 3D model of human bodies. We investigate whether such stylized body shapes result in increased perceived appeal with two different experiments: One focuses on body shape alone, the other investigates the additional role of surface colour and lighting. Our results consistently show that the most appealing avatar is a partially stylized one. Importantly, avatars with high stylization or no stylization at all were rated to have the least appeal. The inclusion of colour information and improvements to render quality had no significant effect on the overall perceived appeal of the avatars, and we observe that the body shape primarily drives the change in appeal ratings. For body scans with colour information, we found that a partially stylized avatar was perceived as most appealing.

publisher site pdf DOI [BibTex]

publisher site pdf DOI [BibTex]


Thumb xl gcpr2017 nugget
Learning to Filter Object Detections

Prokudin, S., Kappler, D., Nowozin, S., Gehler, P.

In Pattern Recognition: 39th German Conference, GCPR 2017, Basel, Switzerland, September 12–15, 2017, Proceedings, pages: 52-62, Springer International Publishing, Cham, 2017 (inbook)

Abstract
Most object detection systems consist of three stages. First, a set of individual hypotheses for object locations is generated using a proposal generating algorithm. Second, a classifier scores every generated hypothesis independently to obtain a multi-class prediction. Finally, all scored hypotheses are filtered via a non-differentiable and decoupled non-maximum suppression (NMS) post-processing step. In this paper, we propose a filtering network (FNet), a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image. This formulation enables end-to-end training of the full object detection pipeline. First, we demonstrate that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS. We further analyze NMS failures and propose a loss formulation that is better aligned with the mean average precision (mAP) evaluation metric. We evaluate FNet on several standard detection datasets. Results surpass standard NMS on highly occluded settings of a synthetic overlapping MNIST dataset and show competitive behavior on PascalVOC2007 and KITTI detection benchmarks.

Paper link (url) DOI Project Page [BibTex]

Paper link (url) DOI Project Page [BibTex]


Thumb xl phd thesis teaser
Learning Inference Models for Computer Vision

Jampani, V.

MPI for Intelligent Systems and University of Tübingen, 2017 (phdthesis)

Abstract
Computer vision can be understood as the ability to perform 'inference' on image data. Breakthroughs in computer vision technology are often marked by advances in inference techniques, as even the model design is often dictated by the complexity of inference in them. This thesis proposes learning based inference schemes and demonstrates applications in computer vision. We propose techniques for inference in both generative and discriminative computer vision models. Despite their intuitive appeal, the use of generative models in vision is hampered by the difficulty of posterior inference, which is often too complex or too slow to be practical. We propose techniques for improving inference in two widely used techniques: Markov Chain Monte Carlo (MCMC) sampling and message-passing inference. Our inference strategy is to learn separate discriminative models that assist Bayesian inference in a generative model. Experiments on a range of generative vision models show that the proposed techniques accelerate the inference process and/or converge to better solutions. A main complication in the design of discriminative models is the inclusion of prior knowledge in a principled way. For better inference in discriminative models, we propose techniques that modify the original model itself, as inference is simple evaluation of the model. We concentrate on convolutional neural network (CNN) models and propose a generalization of standard spatial convolutions, which are the basic building blocks of CNN architectures, to bilateral convolutions. First, we generalize the existing use of bilateral filters and then propose new neural network architectures with learnable bilateral filters, which we call `Bilateral Neural Networks'. We show how the bilateral filtering modules can be used for modifying existing CNN architectures for better image segmentation and propose a neural network approach for temporal information propagation in videos. Experiments demonstrate the potential of the proposed bilateral networks on a wide range of vision tasks and datasets. In summary, we propose learning based techniques for better inference in several computer vision models ranging from inverse graphics to freely parameterized neural networks. In generative vision models, our inference techniques alleviate some of the crucial hurdles in Bayesian posterior inference, paving new ways for the use of model based machine learning in vision. In discriminative CNN models, the proposed filter generalizations aid in the design of new neural network architectures that can handle sparse high-dimensional data as well as provide a way for incorporating prior knowledge into CNNs.

pdf [BibTex]

pdf [BibTex]


Thumb xl auroteaser
Decentralized Simultaneous Multi-target Exploration using a Connected Network of Multiple Robots

Nestmeyer, T., Robuffo Giordano, P., Bülthoff, H. H., Franchi, A.

In pages: 989-1011, Autonomous Robots, 2017 (incollection)

[BibTex]

[BibTex]


Thumb xl coverhand wilson
Capturing Hand-Object Interaction and Reconstruction of Manipulated Objects

Tzionas, D.

University of Bonn, 2017 (phdthesis)

Abstract
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however, even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priori knowledge of the object's shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning effectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically using RGB-D data. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.

Thesis link (url) Project Page [BibTex]

2011


Thumb xl andriluka2011
Benchmark datasets for pose estimation and tracking

Andriluka, M., Sigal, L., Black, M. J.

In Visual Analysis of Humans: Looking at People, pages: 253-274, (Editors: Moesland and Hilton and Kr"uger and Sigal), Springer-Verlag, London, 2011 (incollection)

publisher's site Project Page [BibTex]

2011

publisher's site Project Page [BibTex]


Thumb xl srf2011 2
Steerable random fields for image restoration and inpainting

Roth, S., Black, M. J.

In Markov Random Fields for Vision and Image Processing, pages: 377-387, (Editors: Blake, A. and Kohli, P. and Rother, C.), MIT Press, 2011 (incollection)

Abstract
This chapter introduces the concept of a Steerable Random Field (SRF). In contrast to traditional Markov random field (MRF) models in low-level vision, the random field potentials of a SRF are defined in terms of filter responses that are steered to the local image structure. This steering uses the structure tensor to obtain derivative responses that are either aligned with, or orthogonal to, the predominant local image structure. Analysis of the statistics of these steered filter responses in natural images leads to the model proposed here. Clique potentials are defined over steered filter responses using a Gaussian scale mixture model and are learned from training data. The SRF model connects random fields with anisotropic regularization and provides a statistical motivation for the latter. Steering the random field to the local image structure improves image denoising and inpainting performance compared with traditional pairwise MRFs.

publisher site [BibTex]

publisher site [BibTex]