Deep learning has significantly advanced state-of-the-art for 3D hand pose estimation, of which accuracy can be improved with increased amounts of labelled data. However, acquiring 3D hand pose labels can be extremely difficult. In this talk, I will present our recent two works on leveraging self-supervised learning techniques for hand pose estimation from depth map. In both works, we incorporate differentiable renderer to the network and formulate training loss as model fitting error to update network parameters. In first part of the talk, I will present our earlier work which approximates hand surface with a set of spheres. We then model the pose prior as a variational lower bound with variational auto-encoder(VAE). In second part, I will present our latest work on regressing the vertex coordinates of a hand mesh model with 2D fully convolutional network(FCN) in a single forward pass. In the first stage, the network estimates a dense correspondence field for every pixel on the image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. Without any human annotation, both works can perform competitively with strongly supervised methods. The later work will also be later extended to be compatible with MANO model.
Organizers: Dimitrios Tzionas
Two talks for the price of one! I will present my recent work on the challenging problem of stereo matching of scenes with little or no surface texture, attacking the problem from two very different angles. First, I will discuss how surface orientation priors can be added to the popular semi-global matching (SGM) algorithm, which significantly reduces errors on slanted weakly-textured surfaces. The orientation priors serve as a soft constraint during matching and can be derived in a variety of ways, including from low-resolution matching results and from monocular analysis and Manhattan-world assumptions. Second, we will examine the pathological case of Mondrian Stereo -- synthetic scenes consisting solely of solid-colored planar regions, resembling paintings by Piet Mondrian. I will discuss assumptions that allow disambiguating such scenes, present a novel stereo algorithm employing symbolic reasoning about matched edge segments, and discuss how similar ideas could be utilized in robust real-world stereo algorithms for untextured environments.
Organizers: Anurag Ranjan
Non-planar object deformations result in challenging but informative signal variations. We aim to recover this information in a feedforward manner by employing discriminatively trained convolutional networks. We formulate the task as a regression problem and train our networks by leveraging upon manually annotated correspondences between images and 3D surfaces. In this talk, the focus will be on our recent work "DensePose", where we form the "COCO-DensePose" dataset by introducing an efficient annotation pipeline to collect correspondences between 50K persons appearing in the COCO dataset and the SMPL 3D deformable human-body model. We use our dataset to train CNN-based systems that deliver dense correspondences 'in the wild', namely in the presence of background, occlusions, multiple objects and scale variations. We experiment with fully-convolutional networks and region-based DensePose-RCNN model and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly accurate results in real time (http://densepose.org).
Organizers: Georgios Pavlakos
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. In this talk, I will present my past and current work on Zero-Shot Learning, Vision and Language for Generative Modeling and Explainable Artificial Intelligence in that (1) how we can generalize the image classification models to the cases when no visual training data is available, (2) how to generate images and image features using detailed visual descriptions, and (3) how our models focus on discriminating properties of the visible object, jointly predict a class label,explain why the predicted label is appropriate for the image whereas another label is not.
Organizers: Andreas Geiger
Complex shapes can can be summarized using a coarsely defined structure which is consistent and robust across variety of observations. However, existing synthesis techniques do not consider structural decomposition during synthesis, causing generation of implausible or structurally unrealistic shapes. We explore how structure-aware reasoning can benefit existing generative techniques for complex 2D and 3D shapes. We evaluate our methodology on a 3D dataset of chairs and a 2D dataset of typefaces.
Organizers: Sergi Pujades
Organizers: Ahmed Osman
Visual perception involves a complex interaction between feedforward and feedback processes. A mechanistic understanding of these processing, and its limitations, is a necessary first step towards elucidating key aspects of perceptual functions and dysfunctions. In this talk, I will review our ongoing effort towards the understanding of how feedback visual processing operates at the level of the thalamus, a dynamic relay station halfway between the retina and the cortex. I will present experimental evidence from several recent electrophysiology studies performed on subjects engaged in visual detection tasks. The results show that modulatory driving provided by top-down processes (the feedback from primary visual cortex) critically influences the ongoing thalamic activity and shapes the message to be delivered to the cortex. When neuromodulatory techniques (Transcranial Magnetic Stimulation or static magnetic fields) are used to transiently disrupt cortical activity two very interesting effects show up: (1) alterations in stimulus detection and (2) the spatial properties of thalamic receptive fields are dramatically modified. Finally, I will show how sensory information can be a powerful tool to interact with the motor system and re-organize altered patterns of movement in neurological disorders such as Parkinson's disease.
Organizers: Daniel Cudeiro
Disney Research has been actively pushing the state-of-the-art in digitizing humans over the past decade, impacting both academia and industry. In this talk I will give an overview of a selected few projects in this area, from research into production. I will be talking about photogrammetric shape acquisition and dense performance capture for faces, eye and teeth scanning and parameterization, as well as physically based capture and modelling for hair and volumetric tissues.
Organizers: Timo Bolkart
The definition of art has been debated for more than 1000 years, and continues to be a puzzle. While scientific investigations offer hope of resolving this puzzle, machine learning classifiers that discriminate art from non-art images generally do not provide an explicit definition, and brain imaging and psychological theories are at present too coarse to provide a formal characterization. In this work, rather than approaching the problem using a machine learning approach trained on existing artworks, we hypothesize that art can be defined in terms of preexisting properties of the visual cortex. Specifically, we propose that a broad subset of visual art can be defined as patterns that are exciting to a visual brain. Resting on the finding that artificial neural networks trained on visual tasks can provide predictive models of processing in the visual cortex, our definition is operationalized by using a trained deep net as a surrogate “visual brain”, where “exciting” is defined as the activation energy of particular layers of this net. We find that this definition easily discriminates a variety of art from non-art, and further provides a ranking of art genres that is consistent with our subjective notion of ‘visually exciting’. By applying a deep net visualization technique, we can also validate the definition by generating example images that would be classified as art. The images synthesized under our definition resemble visually exciting art such as Op Art and other human- created artistic patterns.
Organizers: Michael Black
One of the central problems of artificial intelligence is machine perception, i.e., the ability to understand the visual world based on input from sensors such as cameras. In this talk, I will present recent progress with respect to data generation using weak annotations, motion information and synthetic data. I will also discuss our recent results for action recognition, where human tubes and tubelets have shown to be successful. Our tubelets moves away from state-of-the-art frame based approaches and improve classification and localization by relying on joint information from several frames. I also show how to extend this type of method to weakly supervised learning of actions, which allows us to scale to large amounts of data with sparse manual annotation. Furthermore, I discuss several recent extensions, including 3D pose estimation.
Organizers: Ahmed Osman
Quantifying behavior is crucial for many applications in neuroscience. Videography provides easy methods for the observation and recording of animal behavior in diverse settings, yet extracting particular aspects of a behavior for further analysis can be highly time consuming. In motor control studies, humans or other animals are often marked with reflective markers to assist with computer-based tracking, yet markers are intrusive (especially for smaller animals), and the number and location of the markers must be determined a priori. Here, we present a highly efficient method for markerless tracking based on transfer learning with deep neural networks that achieves excellent results with minimal training data. We demonstrate the versatility of this framework by tracking various body parts in a broad collection of experimental settings: mice odor trail-tracking, egg-laying behavior in drosophila, and mouse hand articulation in a skilled forelimb task. For example, during the skilled reaching behavior, individual joints can be automatically tracked (and a confidence score is reported). Remarkably, even when a small number of frames are labeled (≈200), the algorithm achieves excellent tracking performance on test frames that is comparable to human accuracy.
Organizers: Melanie Feldhofer