Applying data-driven approaches to non-rigid 3D reconstruction has been difficult, which we believe can be attributed to the lack of a large-scale training corpus. One recent approach proposes self-supervision based on non-rigid reconstruction. Unfortunately, this method fails for important cases such as highly non-rigid deformations. We first address this problem of lack of data by introducing a novel semi-supervised strategy to obtain dense interframe correspondences from a sparse set of annotations. This way, we obtain a large dataset of 400 scenes, over 390,000 RGB-D frames, and 2,537 densely aligned frame pairs; in addition, we provide a test set along with several metrics for evaluation. Based on this corpus, we introduce a data-driven non-rigid feature matching approach, which we integrate into an optimization-based reconstruction pipeline. Here, we propose a new neural network that operates on RGB-D frames, while maintaining robustness under large non-rigid deformations and producing accurate predictions. Our approach significantly outperforms both existing non-rigid reconstruction methods that do not use learned data terms, as well as learning-based approaches that only use self-supervision.
Organizers: Vassilis Choutas
In recent years, commodity 3D sensors have become widely available, spawning significant interest in both offline and real-time 3D reconstruction. While state-of-the-art reconstruction results from commodity RGB-D sensors are visually appealing, they are far from usable in practical computer graphics applications since they do not match the high quality of artist-modeled 3D graphics content. One of the biggest challenges in this context is that obtained 3D scans suffer from occlusions, thus resulting in incomplete 3D models. In this talk, I will present a data-driven approach towards generating high quality 3D models from commodity scan data, and the use of these geometrically complete 3D models towards semantic and texture understanding of real-world environments.
Organizers: Yinghao Huang
In our recent work, XNect, we propose a real-time solution for the challenging task of multi-person 3D human pose estimation from a single RGB camera. To achieve real-time performance without compromising on accuracy, our approach relies on a new efficient Convolutional Neural Network architecture, and a multi-staged pose formulation. The CNN architecture is approx. 1.3x faster than ResNet-50, while achieving the same accuracy on various tasks, and the benefits extend beyond inference speed to a much smaller training memory footprint and a much higher training throughput. The proposed pose formulation jointly reasons about all the subjects in the scene, ensuring that pose inference can be done in real time even with a large number of subjects in the scene. The key insight behind the accuracy of the formulation is to split the reasoning about human pose into two distinct stages. The first stage, which is fully convolutional, infers 2D and 3D pose of body parts supported by image evidence, and reasons jointly about all subjects. The second stage, which is a small fully connected network, operates on each individual subject, and uses the context of the visibly body parts and learned pose priors, to infer the 3D pose of the missing body parts. A third stage on top reconciles the 2D and 3D poses per frame and across time, to produce a temporally stable kinematic skeleton. In this talk, we will briefly discuss the proposed Convolutional Neural Network architecture and the possible benefits it might bring to your workflow. The other part of the talk would be on how the pose formulation proposed in this work came to be, what its advantages are, and how it can be extended to other related problems.
Organizers: Yinghao Huang
In this visual feast, Scott recounts results and revelations from four years of experimentation using machine learning as a ‘creative collaborator’ in his artistic process. He makes the case that AI, rather than rendering artists obsolete, will empower us and expand our creative horizons. In this visual feast, Scott shares an eclectic range of successes and failures encountered in his efforts to create powerful, but artistically controllable neural networks to use as tools to represent and abstract the human figure. Scott also gives a behinds-the-scenes look at creating the work for his recent Artist+AI exhibition in London.
Organizers: Ahmed Osman
In this talk, I will introduce the notion of 'canonicalization' and how it can be used to solve 3D computer vision tasks. I will describe Normalized Object Coordinate Space (NOCS), a 3D canonical container that we have developed for 3D estimation, aggregation, and synthesis tasks. I will demonstrate how NOCS allows us to address previously difficult tasks like category-level 6DoF object pose estimation, and correspondence-free multiview 3D shape aggregation. Finally, I will discuss future directions including opportunities to extend NOCS for tasks like articulated and non-rigid shape and pose estimation.
Organizers: Timo Bolkart
In this talk I will consider the problem of scene-level inverse rendering to recover shape, reflectance and lighting from a single, uncontrolled, outdoor image. This task is highly ill-posed, but we show that multiview self-supervision, a natural lighting prior and implicit lighting estimation allow an image-to-image CNN to solve the task, seemingly learning some general principles of shape-from-shading along the way. Adding a neural renderer and sky generator GAN, our approach allows us to synthesise photorealistic relit images under widely varying illumination. I will finish by briefly describing recent work in which some of these ideas have been combined with deep face model fitting replacing parameter regression with correspondence prediction enabling fully unsupervised training.
Organizers: Timo Bolkart
Licklider and Taylor (1968) envisioned computational machinery that could enable better communication between humans than face-to-face interaction. In the last fifty years, we have used computing to develop various means of communication, such as mail, messaging, phone calls, video conversation, and virtual reality. These are, however, a proxy of face-to-face communication that aims at encoding words, expressions, emotions, and body language at the source and decoding them reliably at the destination. The true revolution of personal computing has not begun yet because we have not been able to tap the real potential of computing for social communication. A computational machinery that can understand and create a four-dimensional audio-visual world can enable humans to describe their imagination and share it with others. In this talk, I will introduce the Computational Studio: an environment that allows non-specialists to construct and creatively edit the 4D audio-visual world from sparse audio and video samples. The Computational Studio aims to enable everyone to relive old memories through a form of virtual time travel, to automatically create new experiences, and share them with others using everyday computational devices. There are three essential components of the Computational Studio: (1) how can we capture 4D audio-visual world?; (2) how can we synthesize the audio-visual world using examples?; and (3) how can we interactively create and edit the audio-visual world? The first part of this talk introduces the work on capturing and browsing in-the-wild 4D audio-visual world in a self-supervised manner and efforts on building a multi-agent capture system. The applications of this work apply to social communication and to digitizing intangible cultural heritage, capturing tribal dances and wildlife in the natural environment, and understanding the social behavior of human beings. In the second part, I will talk about the example-based audio-visual synthesis in an unsupervised manner. Example-based audio-visual synthesis allows us to express ourselves easily. Finally, I will talk about the interactive visual synthesis that allows us to manually create and edit visual experiences. Here I will also stress the importance of thinking about a human user and computational devices when designing content creation applications. The Computational Studio is a first step towards unlocking the full degree of creative imagination, which is currently limited to the human mind by the limits of the individual's expressivity and skill. It has the potential to change the way we audio-visually communicate with others.
Accurate 3D human pose estimation has been a longstanding goal in computer vision. However, till now, it has only gained limited success in easy scenarios such as studios which have little occlusion. In this talk, I will present our two works aiming to address the occlusion problem in realistic scenarios. In the first work, we present an approach to recover absolute 3D human pose of single person from multi-view images by incorporating multi-view geometric priors in our model. It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses. First, we introduce a cross-view fusion scheme into CNN to jointly estimate 2D poses for multiple views. Consequently, the 2D pose estimation for each view already benefits from other views. Second, we present a recursive Pictorial Structure Model to recover the 3D pose from the multi-view 2D poses. It gradually improves the accuracy of 3D pose with affordable computational cost. In the second work, we present a 3D pose estimator which allows us to reliably estimate and track people in crowded scenes. In contrast to the previous efforts which require to establish cross-view correspondence based on noisy and incomplete 2D pose estimations, we present an end-to-end solution which directly operates in the 3D space, therefore avoids making incorrect hard decisions in the 2D space. To achieve this goal, the features in all camera views are warped and aggregated in a common 3D space, and fed to Cuboid Proposal Network (CPN) to coarsely localize all people. Then we propose Pose Regression Network (PRN) to estimate a detailed 3D pose for each proposal. The approach is robust to occlusion which occurs frequently in practice. Without bells and whistles, it significantly outperforms the state-of-the-arts on the benchmark datasets.
Organizers: Chun-Hao Paul Huang
In this talk I will present an overview of our recent works that learn deep geometric models for the 3D face from large datasets of scans. Priors for the 3D face are crucial for many applications: to constrain ill posed problems such as 3D reconstruction from monocular input, for efficient generation and animation of 3D virtual avatars, or even in medical domains such as recognition of craniofacial disorders. Generative models of the face have been widely used for this task, as well as deep learning approaches that have recently emerged as a robust alternative. Barring a few exceptions, most of these data-driven approaches were built from either a relatively limited number of samples (in the case of linear models of the shape), or by synthetic data augmentation (for deep-learning based approaches), mainly due to the difficulty in obtaining large-scale and accurate 3D scans of the face. Yet, there is a substantial amount of 3D information that can be gathered when considering publicly available datasets that have been captured over the last decade. I will discuss here our works that tackle the challenges of building rich geometric models out of these large and varied datasets, with the goal of modeling the facial shape, expression (i.e. motion) or geometric details. Concretely, I will talk about (1) an efficient and fully automatic approach for registration of large datasets of 3D faces in motion; (2) deep learning methods for modeling the facial geometry that can disentangle the shape and expression aspects of the face; and (3) a multi-modal learning approach for capturing geometric details from images in-the-wild, by simultaneously encoding both facial surface normal and natural image information.
Organizers: Jinlong Yang
Biological motion is fascinating in almost every aspect you look upon it. Especially locomotion plays a crucial part in the evolution of life. Structures, like the bones connected by joints, soft and connective tissues and contracting proteins in a muscle-tendon unit enable and prescribe the respective species' specific locomotion pattern. Most importantly, biological motion is autonomously learned, it is untethered as there is no external energy supply and typical for vertebrates, it's muscle-driven. This talk is focused on human motion. Digital models and biologically inspired robots are presented, built for a better understanding of biology’s complexity. Modeling musculoskeletal systems reveals that the mapping from muscle stimulations to movement dynamics is highly nonlinear and complex, which makes it difficult to control those systems with classical techniques. However, experiments on a simulated musculoskeletal model of a human arm and leg and real biomimetic muscle-driven robots show that it is possible to learn an accurate controller despite high redundancy and nonlinearity, while retaining sample efficiency. More examples on active muscle-driven motion will be given.
Organizers: Ahmed Osman
In this talk, I will present about the most recent advances in data-driven character animation and control using neural networks. Creating key-framed animations by hand is typically very time-consuming and requires a lot of artistic expertise and training. Recent work applying deep learning for character animation was firstly able to compete or even outperform the quality that could be achieved by professional animators for biped locomotion, and thus caused a lot excitement in both academia and industry. Shortly after, following research also demonstrated its applicability to quadruped locomotion control, which has been considered one of the unsolved key challenges in character animation due to the highly complex footfall patterns of quadruped characters. Addressing the next challenges beyond character locomotion, this year at SIGGRAPH Asia we presented the Neural State Machine, an improved version of such previous systems in order to make human characters naturally interact with objects and the environment from motion capture data. Generally, the difficulty in such tasks is due to complex planning of periodic and aperiodic movements reacting to the scene geometry in order to precisely position and orient the character, and to adapt to different variations in the type, size and shape of such objects. We demonstrate the versatility of this framework with various scene interaction tasks, such as sitting on a chair, avoiding obstacles, opening and entering through a door, and picking and carrying objects generated in real-time just from a single model.
The body is one of the most relevant aspects of our self, and we shape it through our eating behavior and physical acitivity. As a psychologist and neuroscientist, I seek to disentangle mutual interactions between how we represent our own body, what we eat and how much we exercise. In the talk, I will give a scoping overview of this approach and present the studies I am conducting as a guest scientist at PS.
Organizers: Ahmed Osman