Human pose stability analysis is the key to understanding locomotion and control of body equilibrium, with numerous applications in the fields of Kinesiology, Medicine and Robotics. We propose and validate a novel approach to learn dynamics from kinematics of a human body to aid stability analysis. More specifically, we propose an end-to-end deep learning architecture to regress foot pressure from a human pose derived from video. We have collected and utilized a set of long (5min +) choreographed Taiji (Tai Chi) sequences of multiple subjects with synchronized motion capture, foot pressure and video data. The derived human pose data and corresponding foot pressure maps are used jointly in training a convolutional neural network with residual architecture, named “PressNET”. Cross validation results show promising performance of PressNet, significantly outperforming the baseline method under reasonable sensor noise ranges.
Organizers: Nadine Rueegg
Understanding objects and their behavior from images and videos is a difficult inverse problem. It requires learning a metric in image space that reflects object relations in real world. This metric learning problem calls for large volumes of training data. While images and videos are easily available, labels are not, thus motivating self-supervised metric and representation learning. Furthermore, I will present a widely applicable strategy based on deep reinforcement learning to improve the surrogate tasks underlying self-supervision. Thereafter, the talk will cover the learning of disentangled representations that explicitly separate different object characteristics. Our approach is based on an analysis-by-synthesis paradigm and can generate novel object instances with flexible changes to individual characteristics such as their appearance and pose. It nicely addresses diverse applications in human and animal behavior analysis, a topic we have intensive collaboration on with neuroscientists. Time permitting, I will discuss the disentangling of representations from a wider perspective including novel strategies to image stylization and new strategies for regularization of the latent space of generator networks.
Organizers: Joel Janai
The past few years with the advent of Deep Convolutional Neural Networks (DCNNs), as well as the availability of visual data it was shown that it is possible to produce excellent results in very challenging tasks, such as visual object recognition, detection, tracking etc. Nevertheless, in certain tasks such as fine-grain object recognition (e.g., face recognition) it is very difficult to collect the amount of data that are needed. In this talk, I will show how, using DCNNs, we can generate highly realistic faces and heads and use them for training algorithms such as face and facial expression recognition. Next, I will reverse the problem and demonstrate how by having trained a very powerful face recognition network it can be used to perform very accurate 3D shape and texture reconstruction of faces from a single image. Finally, I will demonstrate how to create very lightweight networks for representing 3D face texture and shape structure by capitalising upon intrinsic mesh convolutions.
Organizers: Dimitris Tzionas
In this talk, I will present my understanding on 3D face reconstruction, modelling and applications from a deep learning perspective. In the first part of my talk, I will discuss the relationship between representations (point clouds, meshes, etc) and network layers (CNN, GCN, etc) on face reconstruction task, then present my ECCV work PRN which proposed a new representation to help achieve state-of-the-art performance on face reconstruction and dense alignment tasks. I will also introduce my open source project face3d that provides examples for generating different 3D face representations. In the second part of the talk, I will talk some publications in integrating 3D techniques into deep networks, then introduce my upcoming work which implements this. In the third part, I will present how related tasks could promote each other in deep learning, including face recognition for face reconstruction task and face reconstruction for face anti-spoofing task. Finally, with such understanding of these three parts, I will present my plans on 3D face modelling and applications.
Organizers: Timo Bolkart
Much existing work in reinforcement learning involves environments that are either intentionally neutral, lacking a role for cooperation and competition, or intentionally simple, when agents need imagine nothing more than that they are playing versions of themselves. Richer game theoretic notions become important as these constraints are relaxed. For humans, this encompasses issues that concern utility, such as envy and guilt, and that concern inference, such as recursive modeling of other players, I will discuss studies treating a paradigmatic game of trust as an interactive partially-observable Markov decision process, and will illustrate the solution concepts with evidence from interactions between various groups of subjects, including those diagnosed with borderline and anti-social personality disorders.
In general Helga Griffiths is a Multi-Sense-Artist working on the intersection of science and art. She has been working for over 20 years on the integration of various sensory stimuli into her “multi-sense” installations. Typical for her work is to produce a sensory experience to transcend conventional boundaries of perception.
I will describe a series of work that aims to automatically understand images of animals and plants. I will begin by describing recent work that uses Bounded Distortion matching to model pose variation in animals. Using a generic 3D model of an animal and multiple images of different individuals in various poses, we construct a model that captures the way in which the animal articulates. This is done by solving for the pose of the template that matches each image while simultaneously solving for the stiffness of each tetrahedron of the model. We minimize an L1 norm on stiffness, producing a model that bends easily at joints, but that captures the rigidity of other parts of the animal. We show that this model can determine the pose of animals such as cats in a wide range of positions. Bounded distortion forms a core part of the matching between 3D model and 2D images. I will also show that Bounded Distortion can be used for 2D matching. We use it to find corresponding features in images very robustly, optimizing an L0 distance to maximize the number of matched features, while bounding the amount of non-rigid variation between the images. We demonstrate the use of this approach in matching non-rigid objects and in wide-baseline matching of features. I will also give an overview of a method for identifying the parts of animals in images, to produce an automatic correspondence between images of animals. Building on these correspondences we develop methods for recognizing the species of a bird, or the breed of a dog. We use these recognition algorithms to construct electronic field guides. I will describe three field guides that we have published, Birdsnap, Dogsnap, and Leafsnap. Leafsnap identifies the species of trees using shape-based matching to compare images of leaves. Leafsnap has been downloaded by over 1.5 million users, and has been used in schools and in biodiversity studies. This work has been done in collaboration with many University of Maryland students and with groups at Columbia University, the Smithsonian Institution National Museum of Natural History, and the Weizmann Institute.
Organizers: Stephan Streuber
The design of tangent vector fields on discrete surfaces is a basic building block for many geometry processing applications, such as surface remeshing, parameterization and architectural geometric design. Many applications require the design of multiple vector fields (vector sets) coupled in a nontrivial way; for example, sets of more than two vectors are used for meshing of triangular, quadrilateral and hexagonal meshes. In this talk, a new, polynomial-based representation for general unordered vector sets will be presented. Using this representation we can efficiently interpolate user provided vector constraints to design vector set fields. Our interpolation scheme will require neither integer period jumps, nor explicit pairings of vectors between adjacent sets on a manifold, as is common in field design literature. Several extensions to the basic interpolation scheme are possible, which make our representation applicable in various scenarios; in this talk, we will focus on generating vector set fields particularly suited for mesh parameterization and show applications in architectural modeling.
Organizers: Gerard Pons-Moll
The recent amazing success of deep learning has been mainly in discriminative learning, that is, classification and regression. An important factor for this success has been, besides Moore's law, the availability of large labeled datasets. However, it is not clear whether in the future the amount of available labels grows as fast as the amount of unlabeled data, providing one argument to be interested in unsupervised and semi-supervised learning. Besides this there are a number of other reasons why unsupervised learning is still important, such as the fact that data in the life sciences often has many more features than instances (p>>n), the fact that probabilities over feature space are useful for planning and control problems and the fact that complex simulator models are the norm in the sciences. In this talk I will discuss deep generative models that can be jointly trained with discriminative models and that facilitate semi-supervised learning. I will discuss recent progress in learning and Bayesian inference in these "variational auto-encoders". I will then extend the deep generative models to the class of simulators for which no tractable likelihood exists and discuss new Bayesian inference procedures to fit these models to data.
Lilla and Bill are two returning artists to Perceiving Systems. Their talk will update us on the exciting projects that they’ve been involved with since their last visit and to present some of their current plans that will unfold during the week (Sept 21st - 25th). They will be joining our department and working with professional dancers in the 4D scanner as part of an art project on mental health. In general, Lilla and Bill have been using 3D captures as an artistic tool to visualize the human body in a contemporary form for some time. They produce marionettes or avatars which can be seen as figures that are anonymous yet universal. Through this medium they portray a prominent theme of human frailty.
In this talk, I will start with describing the pervasiveness of image and video content, and how such content is growing with the ubiquity of cameras. I will use this to motivate the need for better tools for analysis and enhancement of video content. I will start with some of our earlier work on temporal modeling of video, then lead up to some of our current work and describe two main projects. (1) Our approach for a video stabilizer, currently implemented and running on YouTube, and its extensions. (2) A robust and scaleable method for video segmentation. I will describe, in some detail, our Video stabilization method, which generates stabilized videos and is in wide use. Our method allows for video stabilization beyond the conventional filtering that only suppresses high frequency jitter. This method also supports removal of rolling shutter distortions common in modern CMOS cameras that capture the frame one scan-line at a time resulting in non-rigid image distortions such as shear and wobble. Our method does not rely on a-priori knowledge and works on video from any camera or on legacy footage. I will showcase examples of this approach and also discuss how this method is launched and running on YouTube, with Millions of users. Then I will describe an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. This hierarchical approach generates high quality segmentations and we demonstrate the use of this segmentation as users interact with the video, enabling efficient annotation of objects within the video. I will also show some recent work on how this segmentation and annotation can be used to do dynamic scene understanding. I will then follow up with some recent work on image and video analysis in the mobile domains. I will also make some observations about ubiquity of imaging and video in general and need for better tools for video analysis.
Organizers: Naejin Kong
Optics with long focal length have been extensively used for shooting 2D cinema and television, either to virtually get closer to the scene or to produce an aesthetical effect through the deformation of the perspective. However, in 3D cinema or television, the use of long focal length either creates a ``cardboard effect'' or causes visual divergence. To overcome this problem, state-of-the-art methods use disparity mapping techniques, which is a generalization of view interpolation, and generate new stereoscopic pairs from the two image sequences. We propose to use more than two cameras to solve for the remaining issues in disparity mapping methods. In the first part of the talk, we briefly review the causes of visual fatigue and visual discomfort when viewing a stereoscopic film. We model the depth perception from stereopsis of a 3D scene shot with two cameras, and projected in a movie theater or on a 3DTV. We mathematically characterize this 3D distortion, and derive the mathematical constraints associated with the causes of visual fatigue and discomfort. We illustrate these 3D distortions with a new interactive software, ``The Virtual Projection Room". In order to generate the desired stereoscopic images, we propose to use image-based rendering. These techniques usually proceed in two stages. First, the input images are warped into the target view, and then the warped images are blended together. The warps are usually computed with the help of a geometric proxy (either implicit or explicit). Image blending has been extensively addressed in the literature and a few heuristics have proven to achieve very good performance. Yet the combination of the heuristics is not straightforward, and requires manual adjustment of many parameters. We present a new Bayesian approach to the problem of novel view synthesis, based on a generative model taking into account the uncertainty of the image warps in the image formation model. The Bayesian formalism allows us to deduce the energy of the generative model and to compute the desired images as the Maximum a Posteriori estimate. The method outperforms state-of-the-art image-based rendering techniques on challenging datasets. Moreover, the energy equations provide a formalization of the heuristics widely used inimage-based rendering techniques. Besides, the proposed generative model also addresses the problem of super-resolution, allowing to render images at a higher resolution than the initial ones. In the last part of the presentation, we apply the new rendering technique to the case of the stereoscopic zoom.
The visual effects and entertainment industries are now a fundamental part of the computer graphics and vision landscapes - as well as impacting across society in general. One of the issues in this area is the creation of realistic characters, creating assets for production, and improving work-flow. Advances in computer graphics, vision and rendering have underlined much of the success of these industries, built on top of academic advances. However, there are still many unsolved problems. In this talk I will outline some of the challenges we have faced in crossing over academic research into the visual effects industry. In particular, I will attempt to distinguish between academic challenges and industrial demands we have experienced - and how this has impacted projects. This draws on experience in several themes involving leading Visual Effects and entertainment companies. Our work has been in several diverse areas, including on-set capture, digital doubles, real-time animation and motion capture retargeting. I will describe how many of these problems led to us step back and focus on first solving more fundamental computer vision research problems - particularly in the area of optical flow, non-rigid tracking and shadow removal - and how these opened up other opportunities. Some of these projects are supported through our Centre for Digital Entertainment (CDE) - which has 60 PhD level student embedded across the creative industries in the UK. Others are more specific to partners at The Imaginarium and Double Negative Visual Effects. Attempting to draw these experiences together, we are now starting a new Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA), with leading partners across entertainment, elite sport and rehabilitation.
Organizers: Silvia Zuffi
Current object class detection methods typically target 2D bounding box localization, encouraged by benchmark data sets, such as Pascal VOC. While this seems suitable for the detection of individual objects, higher-level applications, such as autonomous driving and 3D scene understanding, would benefit from more detailed and richer object hypotheses. In this talk I will present our recent work on building more detailed object class detectors, bridging the gap between higher level tasks and state-of-the-art object detectors. I will present a 3D object class detection method that can reliably estimate the 3D position, orientation and 3D shape of objects from a single image. Based on state-of-the-art CNN features, the method is a carefully designed 3D detection pipeline where each step is tuned for better performance, resulting in a registered CAD model for every object in the image. In the second part of the talk, I will focus on our work on what is holding back convolutional neural nets for detection. We analyze the R-CNN object detection pipeline in combination with state-of-the-art network architectures (AlexNet, GoogleNet and VGG16). Focusing on two central questions, what did the convnets learn and what can they learn, we illustrate that the three network architectures suffer from the same weaknesses, and these downsides can not be alleviated by simply introducing more data. Therefore we conclude that architectural changes are needed. Furthermore, we show that additional, synthetical generated training data, sampled from the modes of the data distribution can further increase the overall detection performance, while still suffering from the same weaknesses. Last, we hint at the complementary nature of the features of the three network architectures considered in this work.
Most computer vision systems cannot take advantage of the abundance of Internet videos as training data. This is because current methods typically learn under strong supervision and require expensive manual annotations. (e.g. videos need to be temporally trimmed to cover the duration of a specific action, object bounding boxes, etc.). In this talk, I will present two techniques that can lead to learning the behavior and the structure of articulated object classes (e.g. animals) from videos, with as little human supervision as possible. First, we discover the characteristic motion patterns of an object class from videos of objects performing natural, unscripted behaviors, such as tigers in the wild. Our method generates temporal intervals that are automatically trimmed to one instance of the discovered behavior, and clusters them by type (e.g. running, turning head, drinking water). Second, we automatically recover thousands of spatiotemporal correspondences within the discovered clusters of behavior, which allow mapping pixels of an instance in one video to those of a different instance in a different video. Both techniques rely on a novel motion descriptor modeling the relative displacement of pairs of trajectories, which is more suitable for articulated objects than state-of-the-art descriptors using single trajectories. We provide extensive quantitative evaluation on our new dataset of tiger videos, which contains more than 100k fully annotated frames.
Organizers: Laura Sevilla