Deep learning has significantly advanced state-of-the-art for 3D hand pose estimation, of which accuracy can be improved with increased amounts of labelled data. However, acquiring 3D hand pose labels can be extremely difficult. In this talk, I will present our recent two works on leveraging self-supervised learning techniques for hand pose estimation from depth map. In both works, we incorporate differentiable renderer to the network and formulate training loss as model fitting error to update network parameters. In first part of the talk, I will present our earlier work which approximates hand surface with a set of spheres. We then model the pose prior as a variational lower bound with variational auto-encoder(VAE). In second part, I will present our latest work on regressing the vertex coordinates of a hand mesh model with 2D fully convolutional network(FCN) in a single forward pass. In the first stage, the network estimates a dense correspondence field for every pixel on the image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. Without any human annotation, both works can perform competitively with strongly supervised methods. The later work will also be later extended to be compatible with MANO model.
Organizers: Dimitrios Tzionas
Modeling and reconstruction of shape and motion are problems of fundamental importance in computer vision. Inverse Problem theory constitutes a powerful mathematical framework for dealing with ill-posed problems as the ones typically arising in shape and motion modeling. In this talk, I will present methods inspired by Inverse Problem theory, for dealing with four different shape and motion modeling problems. In particular, in the context of shape modeling, I will present a method for component-wise modeling of articulated objects and its application in computing 3D models of animals. Additionally, I will discuss the problem of modeling of specular surfaces via the properties of their material, and I will also present a model for confidence driven depth image fusion based on total variation regularization. Regarding motion, I will discuss a method for the recognition of human actions from motion capture data based on Nonparametric Bayesian models.
Inverse problems are ubiquitous in image processing and applied science in general. Such problems describe the challenge of computing the parameters that characterize a system from the outcomes. While this might seem easy at first for simple systems, many inverse problems share a property that makes them much more intricate: they are ill-posed. This means that either the problem does not have a unique solution or this solution does not depend continuously on the outcomes of the system. Bayesian statistics provides a framework that allows to treat such problems in a systematic way. The missing piece of information is encoded as a prior distribution on the space of possible solutions. In this talk, we will study probabilistic image models as priors for statistical inversion. In particular, we will give a probabilistic interpretation of the classical TV-prior and discuss how this interpretation can be used as a starting point for more complex models. We will see that many important auxiliary quantities such as edges and regions can be incorporated into the model in the form of latent variables. This leads to the conjecture that many image processing tasks, such as denoising and segmentation, should not be considered separately, but instead be treated together.
In general Helga Griffiths is a Multi-Sense-Artist working on the intersection of science and art. She has been working for over 20 years on the integration of various sensory stimuli into her “multi-sense” installations. Typical for her work is to produce a sensory experience to transcend conventional boundaries of perception.
Organizers: Emma-Jayne Holderness
I will describe a series of work that aims to automatically understand images of animals and plants. I will begin by describing recent work that uses Bounded Distortion matching to model pose variation in animals. Using a generic 3D model of an animal and multiple images of different individuals in various poses, we construct a model that captures the way in which the animal articulates. This is done by solving for the pose of the template that matches each image while simultaneously solving for the stiffness of each tetrahedron of the model. We minimize an L1 norm on stiffness, producing a model that bends easily at joints, but that captures the rigidity of other parts of the animal. We show that this model can determine the pose of animals such as cats in a wide range of positions. Bounded distortion forms a core part of the matching between 3D model and 2D images. I will also show that Bounded Distortion can be used for 2D matching. We use it to find corresponding features in images very robustly, optimizing an L0 distance to maximize the number of matched features, while bounding the amount of non-rigid variation between the images. We demonstrate the use of this approach in matching non-rigid objects and in wide-baseline matching of features. I will also give an overview of a method for identifying the parts of animals in images, to produce an automatic correspondence between images of animals. Building on these correspondences we develop methods for recognizing the species of a bird, or the breed of a dog. We use these recognition algorithms to construct electronic field guides. I will describe three field guides that we have published, Birdsnap, Dogsnap, and Leafsnap. Leafsnap identifies the species of trees using shape-based matching to compare images of leaves. Leafsnap has been downloaded by over 1.5 million users, and has been used in schools and in biodiversity studies. This work has been done in collaboration with many University of Maryland students and with groups at Columbia University, the Smithsonian Institution National Museum of Natural History, and the Weizmann Institute.
Organizers: Stephan Streuber
The design of tangent vector fields on discrete surfaces is a basic building block for many geometry processing applications, such as surface remeshing, parameterization and architectural geometric design. Many applications require the design of multiple vector fields (vector sets) coupled in a nontrivial way; for example, sets of more than two vectors are used for meshing of triangular, quadrilateral and hexagonal meshes. In this talk, a new, polynomial-based representation for general unordered vector sets will be presented. Using this representation we can efficiently interpolate user provided vector constraints to design vector set fields. Our interpolation scheme will require neither integer period jumps, nor explicit pairings of vectors between adjacent sets on a manifold, as is common in field design literature. Several extensions to the basic interpolation scheme are possible, which make our representation applicable in various scenarios; in this talk, we will focus on generating vector set fields particularly suited for mesh parameterization and show applications in architectural modeling.
Organizers: Gerard Pons-Moll
The recent amazing success of deep learning has been mainly in discriminative learning, that is, classification and regression. An important factor for this success has been, besides Moore's law, the availability of large labeled datasets. However, it is not clear whether in the future the amount of available labels grows as fast as the amount of unlabeled data, providing one argument to be interested in unsupervised and semi-supervised learning. Besides this there are a number of other reasons why unsupervised learning is still important, such as the fact that data in the life sciences often has many more features than instances (p>>n), the fact that probabilities over feature space are useful for planning and control problems and the fact that complex simulator models are the norm in the sciences. In this talk I will discuss deep generative models that can be jointly trained with discriminative models and that facilitate semi-supervised learning. I will discuss recent progress in learning and Bayesian inference in these "variational auto-encoders". I will then extend the deep generative models to the class of simulators for which no tractable likelihood exists and discuss new Bayesian inference procedures to fit these models to data.
Organizers: Peter Vincent Gehler
Lilla and Bill are two returning artists to Perceiving Systems. Their talk will update us on the exciting projects that they’ve been involved with since their last visit and to present some of their current plans that will unfold during the week (Sept 21st - 25th). They will be joining our department and working with professional dancers in the 4D scanner as part of an art project on mental health. In general, Lilla and Bill have been using 3D captures as an artistic tool to visualize the human body in a contemporary form for some time. They produce marionettes or avatars which can be seen as figures that are anonymous yet universal. Through this medium they portray a prominent theme of human frailty.
Organizers: Emma-Jayne Holderness
In this talk, I will start with describing the pervasiveness of image and video content, and how such content is growing with the ubiquity of cameras. I will use this to motivate the need for better tools for analysis and enhancement of video content. I will start with some of our earlier work on temporal modeling of video, then lead up to some of our current work and describe two main projects. (1) Our approach for a video stabilizer, currently implemented and running on YouTube, and its extensions. (2) A robust and scaleable method for video segmentation. I will describe, in some detail, our Video stabilization method, which generates stabilized videos and is in wide use. Our method allows for video stabilization beyond the conventional filtering that only suppresses high frequency jitter. This method also supports removal of rolling shutter distortions common in modern CMOS cameras that capture the frame one scan-line at a time resulting in non-rigid image distortions such as shear and wobble. Our method does not rely on a-priori knowledge and works on video from any camera or on legacy footage. I will showcase examples of this approach and also discuss how this method is launched and running on YouTube, with Millions of users. Then I will describe an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. This hierarchical approach generates high quality segmentations and we demonstrate the use of this segmentation as users interact with the video, enabling efficient annotation of objects within the video. I will also show some recent work on how this segmentation and annotation can be used to do dynamic scene understanding. I will then follow up with some recent work on image and video analysis in the mobile domains. I will also make some observations about ubiquity of imaging and video in general and need for better tools for video analysis.
Organizers: Naejin Kong
Optics with long focal length have been extensively used for shooting 2D cinema and television, either to virtually get closer to the scene or to produce an aesthetical effect through the deformation of the perspective. However, in 3D cinema or television, the use of long focal length either creates a ``cardboard effect'' or causes visual divergence. To overcome this problem, state-of-the-art methods use disparity mapping techniques, which is a generalization of view interpolation, and generate new stereoscopic pairs from the two image sequences. We propose to use more than two cameras to solve for the remaining issues in disparity mapping methods. In the first part of the talk, we briefly review the causes of visual fatigue and visual discomfort when viewing a stereoscopic film. We model the depth perception from stereopsis of a 3D scene shot with two cameras, and projected in a movie theater or on a 3DTV. We mathematically characterize this 3D distortion, and derive the mathematical constraints associated with the causes of visual fatigue and discomfort. We illustrate these 3D distortions with a new interactive software, ``The Virtual Projection Room". In order to generate the desired stereoscopic images, we propose to use image-based rendering. These techniques usually proceed in two stages. First, the input images are warped into the target view, and then the warped images are blended together. The warps are usually computed with the help of a geometric proxy (either implicit or explicit). Image blending has been extensively addressed in the literature and a few heuristics have proven to achieve very good performance. Yet the combination of the heuristics is not straightforward, and requires manual adjustment of many parameters. We present a new Bayesian approach to the problem of novel view synthesis, based on a generative model taking into account the uncertainty of the image warps in the image formation model. The Bayesian formalism allows us to deduce the energy of the generative model and to compute the desired images as the Maximum a Posteriori estimate. The method outperforms state-of-the-art image-based rendering techniques on challenging datasets. Moreover, the energy equations provide a formalization of the heuristics widely used inimage-based rendering techniques. Besides, the proposed generative model also addresses the problem of super-resolution, allowing to render images at a higher resolution than the initial ones. In the last part of the presentation, we apply the new rendering technique to the case of the stereoscopic zoom.
The visual effects and entertainment industries are now a fundamental part of the computer graphics and vision landscapes - as well as impacting across society in general. One of the issues in this area is the creation of realistic characters, creating assets for production, and improving work-flow. Advances in computer graphics, vision and rendering have underlined much of the success of these industries, built on top of academic advances. However, there are still many unsolved problems. In this talk I will outline some of the challenges we have faced in crossing over academic research into the visual effects industry. In particular, I will attempt to distinguish between academic challenges and industrial demands we have experienced - and how this has impacted projects. This draws on experience in several themes involving leading Visual Effects and entertainment companies. Our work has been in several diverse areas, including on-set capture, digital doubles, real-time animation and motion capture retargeting. I will describe how many of these problems led to us step back and focus on first solving more fundamental computer vision research problems - particularly in the area of optical flow, non-rigid tracking and shadow removal - and how these opened up other opportunities. Some of these projects are supported through our Centre for Digital Entertainment (CDE) - which has 60 PhD level student embedded across the creative industries in the UK. Others are more specific to partners at The Imaginarium and Double Negative Visual Effects. Attempting to draw these experiences together, we are now starting a new Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA), with leading partners across entertainment, elite sport and rehabilitation.
Organizers: Silvia Zuffi