Current solutions to discriminative and generative tasks in computer vision exist separately and often lack interpretability and explainability. Using faces as our application domain, here we present an architecture that is based around two core ideas that address these issues: first, our framework learns an unsupervised, low-dimensional embedding of faces using an adversarial autoencoder that is able to synthesize high-quality face images. Second, a supervised disentanglement splits the low-dimensional embedding vector into four sub-vectors, each of which contains separated information about one of four major face attributes (pose, identity, expression, and style) that can be used both for discriminative tasks and for manipulating all four attributes in an explicit manner. The resulting architecture achieves state-of-the-art image quality, good discrimination and face retrieval results on each of the four attributes, and supports various face editing tasks using a face representation of only 99 dimensions. Finally, we apply the architecture's robust image synthesis capabilities to visually debug label-quality issues in an existing face dataset.
Organizers: Timo Bolkart
In my talk I will present my work regarding 3D mapping using lidar scanners. I will give an overview of the SLAM problem and its main challenges: robustness, accuracy and processing speed. Regarding robustness and accuracy, we investigate a better point cloud representation based on resampling and surface reconstruction. Moreover, we demonstrate how it can be incorporated in an ICP-based scan matching technique. Finally, we elaborate on globally consistent mapping using loop closures. Regarding processing speed, we propose the integration of our scan matching in a multi-resolution scheme and a GPU-accelerated implementation using our programming language Quasar.
Organizers: Simon Donne
In this talk we will address the problem of 3D reconstruction of rigid and deformable objects from a single depth video stream. Traditional 3D registration techniques, such as ICP and its variants, are wide-spread and effective, but sensitive to initialization and noise due to the underlying correspondence estimation procedure. Therefore, we have developed SDF-2-SDF, a dense, correspondence-free method which aligns a pair of implicit representations of scene geometry, e.g. signed distance fields, by minimizing their direct voxel-wise difference. In its rigid variant, we apply it for static object reconstruction via real-time frame-to-frame camera tracking and posterior multiview pose optimization, achieving higher accuracy and a wider convergence basin than ICP variants. Its extension to scene reconstruction, SDF-TAR, carries out the implicit-to-implicit registration over several limited-extent volumes anchored in the scene and runs simultaneous GPU tracking and CPU refinement, with a lower memory footprint than other SLAM systems. Finally, to handle non-rigidly moving objects, we incorporate the SDF-2-SDF energy in a variational framework, regularized by a damped approximately Killing vector field. The resulting system, KillingFusion, is able to reconstruct objects undergoing topological changes and fast inter-frame motion in near-real time.
Organizers: Fatma Güney
Visual Question Answering is one of the applications of Deep Learning that is pushing towards real Artificial Intelligence. It turns the typical deep learning process around by only defining the task to be carried out after the training has taken place, which changes the task fundamentally. We have developed a range of strategies for incorporating other information sources into deep learning-based methods, and the process taken a step towards developing algorithms which learn how to use other algorithms to solve a problem, rather than solving it directly. This talk thus covers some of the high-level questions about the types of challenges Deep Learning can be applied to, and how we might separate the things its good at from those that it’s not.
Organizers: Siyu Tang
Creating convincing human facial animation is challenging. Face animation is often hand-crafted by artists separately from body motion. Alternatively, if the face animation is derived from motion capture, it is typically performed while the actor is relatively still. Recombining the isolated face animation with body motion is non-trivial and often results in uncanny results if the body dynamics are not properly reflected on the face (e.g. cheeks wiggling when running). In this talk, I will discuss the challenges of human soft tissue simulation and control. I will then present our method for adding physical effects to facial blendshape animation. Unlike previous methods that try to add physics to face rigs, our method can combine facial animation and rigid body motion consistently while preserving the original animation as closely as possible. Our novel simulation framework uses the original animation as per-frame rest-poses without adding spurious forces. We also propose the concept of blendmaterials to give artists an intuitive means to control the changing material properties due to muscle activation.
Organizers: Timo Bolkart
Recently, deep learning proved to be successful also on low level vision tasks such as stereo matching. Another recent trend in this latter field is represented by confidence measures, with increasing effectiveness when coupled with random forest classifiers or CNNs. Despite their excellent accuracy in outliers detection, few other applications rely on them. In the first part of the talk, we'll take a look at the latest proposal in terms of confidence measures for stereo matching, as well as at some novel methodologies exploiting these very accurate cues. In the second part, we'll talk about GC-net, a deep network currently representing the state-of-the-art on the KITTI datasets, and its extension to motion stereo processing.
Organizers: Yiyi Liao
Growth of the internet and social media has spurred the sharing and dissemination of personal data at large scale. At the same time, recent developments in computer vision has enabled unseen effectiveness and efficiency in automated recognition. It is clear that visual data contains private information that can be mined, yet the privacy implications of sharing such data have been less studied in computer vision community. In the talk, I will present some key results from our study of the implications of the development of computer vision on the identifiability in social media, and an analysis of existing and new anonymisation techniques. In particular, we show that adversarial image perturbations (AIP) introduce human invisible perturbations on the input image that effectively misleads a recogniser. They are far more aesthetic and effective compared to e.g. face blurring. The core limitation, however, is that AIPs are usually generated against specific target recogniser(s), and it is hard to guarantee the performance against uncertain, potentially adaptive recognisers. As a first step towards dealing with the uncertainty, we have introduced a game theoretical framework to obtain the user’s privacy guarantee independent of the randomly chosen recogniser (within some fixed set).
Organizers: Siyu Tang
In the recent years, commodity 3D sensors have become easily and widely available. These advances in sensing technology have spawned significant interest in using captured 3D data for mapping and semantic understanding of 3D environments. In this talk, I will give an overview of our latest research in the context of 3D reconstruction of indoor environments. I will further talk about the use of 3D data in the context of modern machine learning techniques. Specifically, I will highlight the importance of training data, and how can we efficiently obtain labeled and self-supervised ground truth training datasets from captured 3D content. Finally, I will show a selection of state-of-the-art deep learning approaches, including discriminative semantic labeling of 3D scenes and generative reconstruction techniques.
Organizers: Despoina Paschalidou
Our world is dynamic and three-dimensional. Understanding the 3D layout of scenes and the motion of objects is crucial for successfully operating in such an environment. I will talk about two lines of recent research in this direction. One is on end-to-end learning of motion and 3D structure: optical flow estimation, binocular and monocular stereo, direct generation of large volumes with convolutional networks. The other is on sensorimotor control in immersive three-dimensional environments, learned from experience or from demonstration.
We transfer a monocular motion stereo 3D reconstruction algorithm from a mobile device (Google Project Tango Tablet) to a rigidly mounted external camera of higher image resolution. A reliable camera synchronization is crucial for the usability of the tablets IMU data and thus a time synchronization method developed. It is based on the joint movement of the cameras. In a second project, we move from outdoor video scenes to aerial images and strive to segment them into polygonal shapes. While most existing approaches address the problem of automated generation of online maps as a pixel-wise segmentation task, we instead frame this problem as constructing polygons representing objects. An approach based on Faster R-CNN, a successful object detection algorithm, is presented.
Organizers: Siyu Tang
We propose a new architecture for the learning of predictive spatio-temporal motion models from data alone. Our approach, dubbed the Dropout Autoencoder LSTM, is capable of synthesizing natural looking motion sequences over long time horizons without catastrophic drift or mo- tion degradation. The model consists of two components, a 3-layer recurrent neural network to model temporal aspects and a novel auto-encoder that is trained to implicitly recover the spatial structure of the human skeleton via randomly removing information about joints during train- ing time. This Dropout Autoencoder (D-AE) is then used to filter each predicted pose of the LSTM, reducing accumulation of error and hence drift over time. Furthermore, we propose new evaluation protocols to assess the quality of synthetic motion sequences even for which no groundtruth data exists. The proposed protocols can be used to assess generated sequences of arbitrary length. Finally, we evaluate our proposed method on two of the largest motion- capture datasets available to date and show that our model outperforms the state-of-the-art on a variety of actions, including cyclic and acyclic motion, and that it can produce natural looking sequences over longer time horizons than previous methods.
Organizers: Gerard Pons-Moll