Deep learning has significantly advanced state-of-the-art for 3D hand pose estimation, of which accuracy can be improved with increased amounts of labelled data. However, acquiring 3D hand pose labels can be extremely difficult. In this talk, I will present our recent two works on leveraging self-supervised learning techniques for hand pose estimation from depth map. In both works, we incorporate differentiable renderer to the network and formulate training loss as model fitting error to update network parameters. In first part of the talk, I will present our earlier work which approximates hand surface with a set of spheres. We then model the pose prior as a variational lower bound with variational auto-encoder(VAE). In second part, I will present our latest work on regressing the vertex coordinates of a hand mesh model with 2D fully convolutional network(FCN) in a single forward pass. In the first stage, the network estimates a dense correspondence field for every pixel on the image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. Without any human annotation, both works can perform competitively with strongly supervised methods. The later work will also be later extended to be compatible with MANO model.
Organizers: Dimitrios Tzionas
Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.
Human footsteps can provide a unique behavioural pattern for robust biometric systems. Traditionally, security systems have been based on passwords or security access cards. Biometric recognition deals with the design of security systems for automatic identification or verification of a human subject (client) based on physical and behavioural characteristics. In this talk, I will present spatio-temporal raw and processed footstep data representations designed and evaluated on deep machine learning models based on a two-stream resnet architecture, by using the SFootBD database the largest footstep database to date with more than 120 people and almost 20,000 footstep signals. Our models deliver an artificial intelligence capable of effectively differentiating the fine-grained variability of footsteps between legitimate users (clients) and impostor users of the biometric system. We provide experimental results in 3 critical data-driven security scenarios, according to the amount of footstep data available for model training: at airports security checkpoints (smallest training set), workspace environments (medium training set) and home environments (largest training set). In these scenarios we report state-of-the-art footstep recognition rates.
Organizers: Dimitrios Tzionas
Animals are widespread in nature and the analysis of their shape and motion is of importance in many fields and industries. Modeling 3D animal shape, however, is difficult because the 3D scanning methods used to capture human shape are not applicable to wild animals or natural settings. In our previous SMAL model, we learn animal shape from toys figurines, but toys are limited in number and realism, and not every animal is sufficiently popular for there to be realistic toys depicting it. What is available in large quantities are images and videos of animals from nature photographs, animal documentaries, and webcams. In this talk I will present our recent work for capturing the detailed 3D shape of animals from images alone. Our method extracts significantly more 3D shape detail than previous work and is able to model new species using only a few video frames. Additionally, we extract realistic texture map from images for capturing both animal shape and appearance.
Active vision has long put forward the idea, that visual sensation and our actions are inseparable, especially when considering naturalistic extended behavior. Further support for this idea comes from theoretical work in optimal control, which demonstrates that sensing, planning, and acting in sequential tasks can only be separated under very restricted circumstances. The talk will present experimental evidence together with computational explanations of human visuomotor behavior in tasks ranging from classic psychophysical detection tasks to ball catching and visuomotor navigation. Along the way it will touch topics such as the heuristics hypothesis and learning of visual representations. The connecting theme will be that, from the switching of visuomotor behavior in response to changing task-constraints down to cortical visual representations in V1, action and perception are inseparably intertwined in an ambiguous and uncertain world
Organizers: Betty Mohler
The tongue plays a vital part in everyday life where we use it extensively during speech production. Due to this importance, we want to derive a parametric shape model of the tongue. This model enables us to reconstruct the full tongue shape from a sparse set of points, like for example motion capture data. Moreover, we can use such a model in simulations of the vocal tract to perform articulatory speech synthesis or to create animated virtual avatars. In my talk, I describe a framework for deriving such a model from MRI scans of the vocal tract. In particular, this framework uses image denoising and segmentation methods to produce a point cloud approximating the vocal tract surface. In this context, I will also discuss how palatal contacts of the tongue can be handled, i.e., situations where the tongue touches the palate and thus no tongue boundary is visible. Afterwards, template matching is used to derive a mesh representation of the tongue from this cloud. The acquired meshes are finally used to construct a multilinear model.
Organizers: Timo Bolkart
The emergence of multi-view capture systems has yield a tremendous amount of video sequences. The task of capturing spatio-temporal models from real world imagery (4D modeling) should arguably benefit from this enormous visual information. In order to achieve highly realistic representations both geometry and appearance need to be modeled in high precision. Yet, even with the great progress of the geometric modeling, the appearance aspect has not been fully explored and visual quality can still be improved. I will explain how we can optimally exploit the redundant visual information of the captured video sequences and provide a temporally coherent, super-resolved, view-independent appearance representation. I will further discuss how to exploit the interdependency of both geometry and appearance as separate modalities to enhance visual perception and finally how to decompose appearance representations into intrinsic components (shading & albedo) and super-resolve them jointly to allow for more realistic renderings.
Organizers: Despoina Paschalidou
For man-machine interaction it is crucial to develop models of humans that look and move indistinguishably from real humans. Such virtual humans will be key for application areas such as computer vision, medicine and psychology, virtual and augmented reality and special effects in movies. Currently, digital models typically lack realistic soft tissue and clothing or require time-consuming manual editing of physical simulation parameters. Our hypothesis is that better and more realistic models of humans and clothing can be learned directly from real measurements coming from 4D scans, images and depth and inertial sensors. We combine statistical machine learning techniques and physics based simulation to create realistic models from data. We then use such models to extract information out of incomplete and noisy sensor data from monocular video, depth or IMUs. I will give an overview of a selection of projects conducted in Perceiving Systems in which we build realistic models of human pose, shape, soft-tissue and clothing. I will also present some of our recent work on 3D reconstruction of people models from monocular video, real-time fusion and online human body shape estimation from depth data and recovery of human pose in the wild from video and IMUs. I will conclude the talk outlining the next challenges in building digital humans and perceiving them from sensory data.
Organizers: Melanie Feldhofer
Variational image processing translates image processing tasks into optimisation problems. The practical success of this approach depends on the type of optimisation problem and on the properties of the ensuing algorithm. A recent breakthrough was to realise that old first-order optimisation algorithms based on operator splitting are particularly suited for modern data analysis problems. Operator splitting techniques decouple complex optimisation problems into many smaller and simpler sub-problems. In this talk I will revise the variational segmentation problem and a common family of algorithms to solve such optimisation problems. I will show that operator splitting leads to a divide-and-conquer strategy that allows to derive simple and massively parallel updates suitable for GPU implementations. The technique decouples the likelihood from the prior term and allows to use a data-driven model estimating the likelihood from data, using for example deep learning. Using a different decoupling strategy together with general consensus optimisation leads to fully distributed algorithms especially suitable for large-scale segmentation problems. Motivating applications are 3d yeast-cell reconstruction and segmentation of histology data.
Organizers: Benjamin Coors
In my talk I will present my work regarding 3D mapping using lidar scanners. I will give an overview of the SLAM problem and its main challenges: robustness, accuracy and processing speed. Regarding robustness and accuracy, we investigate a better point cloud representation based on resampling and surface reconstruction. Moreover, we demonstrate how it can be incorporated in an ICP-based scan matching technique. Finally, we elaborate on globally consistent mapping using loop closures. Regarding processing speed, we propose the integration of our scan matching in a multi-resolution scheme and a GPU-accelerated implementation using our programming language Quasar.
Organizers: Simon Donne
In this talk we will address the problem of 3D reconstruction of rigid and deformable objects from a single depth video stream. Traditional 3D registration techniques, such as ICP and its variants, are wide-spread and effective, but sensitive to initialization and noise due to the underlying correspondence estimation procedure. Therefore, we have developed SDF-2-SDF, a dense, correspondence-free method which aligns a pair of implicit representations of scene geometry, e.g. signed distance fields, by minimizing their direct voxel-wise difference. In its rigid variant, we apply it for static object reconstruction via real-time frame-to-frame camera tracking and posterior multiview pose optimization, achieving higher accuracy and a wider convergence basin than ICP variants. Its extension to scene reconstruction, SDF-TAR, carries out the implicit-to-implicit registration over several limited-extent volumes anchored in the scene and runs simultaneous GPU tracking and CPU refinement, with a lower memory footprint than other SLAM systems. Finally, to handle non-rigidly moving objects, we incorporate the SDF-2-SDF energy in a variational framework, regularized by a damped approximately Killing vector field. The resulting system, KillingFusion, is able to reconstruct objects undergoing topological changes and fast inter-frame motion in near-real time.
Organizers: Fatma Güney