Creating convincing human facial animation is challenging. Face animation is often hand-crafted by artists separately from body motion. Alternatively, if the face animation is derived from motion capture, it is typically performed while the actor is relatively still. Recombining the isolated face animation with body motion is non-trivial and often results in uncanny results if the body dynamics are not properly reflected on the face (e.g. cheeks wiggling when running). In this talk, I will discuss the challenges of human soft tissue simulation and control. I will then present our method for adding physical effects to facial blendshape animation. Unlike previous methods that try to add physics to face rigs, our method can combine facial animation and rigid body motion consistently while preserving the original animation as closely as possible. Our novel simulation framework uses the original animation as per-frame rest-poses without adding spurious forces. We also propose the concept of blendmaterials to give artists an intuitive means to control the changing material properties due to muscle activation.
Organizers: Timo Bolkart
Sensors acquire an increasing amount of diverse information posing two challenges. Firstly, how can we efficiently deal with such a big amount of data and secondly, how can we benefit from this diversity? In this talk I will first present an approach to deal with large graphical models. The presented method distributes and parallelizes the computation and memory requirements while preserving convergence and optimality guarantees of existing inference and learning algorithms. I will demonstrate the effectiveness of the approach on stereo reconstruction from high-resolution imagery. In the second part I will present a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. This framework allows to linearly combine different sources of information and I will demonstrate its efficacy on the problem of estimating the 3D room layout given a single image. For the latter problem I will in a third part introduce a globally optimal yet efficient inference algorithm based on branch-and-bound.
Consumer level depth cameras such as Kinect have changed the landscape of 3D computer vision. In this talk we will discuss two approaches that both learn to directly infer correspondences between observed depth image pixels and 3D model points. These correspondences can then be used to drive an optimization of a generative model to explain the data. The first approach, the "Vitruvian Manifold", aims to fit an articulated 3D human model to a depth camera image, and extends our original Body Part Recognition algorithm used in Kinect. It applies a per-pixel regression forest to infer direct correspondences between image pixels and points on a human mesh model. This allows an efficient “one-shot” continuous optimization of the model parameters to recover the human pose. The second approach, "Scene Coordinate Regression", addresses the problem of camera pose relocalization. It uses a similar regression forest, but now aims to predict correspondences between observed image pixels and 3D world coordinates in an arbitrary 3D scene. These correspondences are again used to drive an efficient optimization of the camera pose to a highly accurate result from a single input frame.
Developing autonomous systems that are able to assist humans in everyday's tasks is one of the grand challenges in modern computer science. Notable examples are personal robotics for the elderly and people with disabilities, as well as autonomous driving systems which can help decrease fatalities caused by traffic accidents. In order to perform tasks such as navigation, recognition and manipulation of objects, these systems should be able to efficiently extract 3D knowledge of their environment. In this talk, I'll show how Markov random fields provide a great mathematical formalism to extract this knowledge. In particular, I'll focus on a few examples, i.e., 3D reconstruction, 3D layout estimation, 2D holistic parsing and object detection, and show representations and inference strategies that allow us to achieve state-of-the-art performance as well as several orders of magnitude speed-ups.
Motion capture and data driven technologies have come very far over the past few years. In terms of human capture the high volume of research that has gone into this sub group has led to very impressive results. Human motion can now be captured in real time which when used in the creative sectors can lead to blockbuster films such as Avatar. Similarly in the medical sectors these techniques can be used to diagnose, analyse performance and avoid invasive procedures in tasks such as deformity correction. There is, however, very little research on motion capture of animals. While the technology for capturing animal motion exists, the method used is inefficient, unreliable and limited, as much manual work is required to turn blocked out motions into acceptable results. How we move forward with a suitable procedure however is the major question. Do we extend the life of marker based capture or do we move towards the holy grail of markerless tracking? In this talk we look at a possible solution suitable for both possibilities through physically based simulation techniques. It is our belief that such techniques could help cross the gap in the uncanny valley as far as marker based capture is concerned but also be useful as far as markerless tracking is concerned.
Non-blind deblurring is an integral component of blind approaches for removing image blur due to camera shake. Even though learning-based deblurring methods exist, they have been limited to the generative case and are computationally expensive. To this date, manually-defined models are thus most widely used, though limiting the attained restoration quality. We address this gap by proposing a discriminative approach for non-blind deblurring. One key challenge is that the blur kernel in use at test time is not known in advance. To address this, we analyze existing approaches that use half-quadratic regularization. From this analysis, we derive a discriminative model cascade for image deblurring. Our cascade model consists of a Gaussian CRF at each stage, based on the recently introduced regression tree fields. We train our model by loss minimization and use synthetically generated blur kernels to generate training data. Our experiments show that the proposed approach is efficient and yields state-of-the-art restoration quality on images corrupted with synthetic and real blur.
Irregular triangle meshes are a powerful digital shape representation: they are flexible and can represent virtually any complex shape; they are efficiently rendered by graphics hardware; they are the standard output of 3D acquisition and routinely used as input to simulation software. Yet irregular meshes are difficult to model and edit because they lack a higher-level control mechanism. In this talk, I will survey a series of research results on surface modeling with meshes and show how high-quality shapes can be manipulated in a fast and intuitive manner. I will outline the current challenges in intelligent and more user-friendly modeling metaphors and will attempt to suggest possible directions for future work in this area.
3D reconstruction from images has been a tremendous success-story of computer vision, with city-scale reconstruction now a reality. However, these successes apply almost exclusively in a static world, where the only motion is that of the camera. Even with the advent of realtime depth cameras, full 3D modelling of dynamic scenes lags behind the rigid-scene case, and for many objects of interest (e.g. animals moving in natural environments), depth sensing remains challenging. In this talk, I will discuss a range of recent work in the modelling of nonrigid real-world 3D shape from 2D images, for example building generic animal models from internet photo collections. While the state of the art depends heavily on dense point tracks from textured surfaces, it is rare to find suitably textured surfaces: most animals are limited in texture (think of dogs, cats, cows, horses, …). I will show how this assumption can be relaxed by incorporating the strong constraints given by the object’s silhouette.
Significant progress has been made over the last years in estimating people's shape and motion from video and nonetheless the problem still remains unsolved. This is especially true in uncontrolled environments such as people in the streets or the office where background clutter and occlusions make the problem even more challenging.
The goal of our research is to develop computational methods that enable human pose estimation from video and inertial sensors in indoor and outdoor environments. Specifically, I will focus on one of our past projects in which we introduce a hybrid Human Motion Capture system that combines video input with sparse inertial sensor input. Employing a particle-based optimization scheme, our idea is to use orientation cues derived from the inertial input to sample particles from the manifold of valid poses. Additionally, we introduce a novel sensor noise model to account for uncertainties based on the von Mises-Fisher distribution. Doing so, orientation constraints are naturally fulfilled and the number of needed particles can be kept very small. More generally, our method can be used to sample poses that fulfill arbitrary orientation or positional kinematic constraints. In the experiments, we show that our system can track even highly dynamic motions in an outdoor environment with changing illumination, background clutter, and shadows.
There are an estimated 3.5 trillion photographs in the world, of which 10% have been taken in the past 12 months. Facebook alone reports 6 billion photo uploads per month. Every minute, 72 hours of video are uploaded to YouTube. Cisco estimates that in the next few years, visual data (photos and video) will account for over 85% of total internet traffic. Yet, we currently lack effective computational methods for making sense of all this mass of visual data. Unlike easily indexed content, such as text, visual content is not routinely searched or mined; it's not even hyperlinked. Visual data is Internet's "digital dark matter" [Perona,2010] -- it's just sitting there!
In this talk, I will first discuss some of the unique challenges that make Big Visual Data difficult compared to other types of content. In particular, I will argue that the central problem is the lack a good measure of similarity for visual data. I will then present some of our recent work that aims to address this challenge in the context of visual matching, image retrieval and visual data mining. As an application of the latter, we used Google Street View data for an entire city in an attempt to answer that age-old question which has been vexing poets (and poets-turned-geeks): "What makes Paris look like Paris?"