Current solutions to discriminative and generative tasks in computer vision exist separately and often lack interpretability and explainability. Using faces as our application domain, here we present an architecture that is based around two core ideas that address these issues: first, our framework learns an unsupervised, low-dimensional embedding of faces using an adversarial autoencoder that is able to synthesize high-quality face images. Second, a supervised disentanglement splits the low-dimensional embedding vector into four sub-vectors, each of which contains separated information about one of four major face attributes (pose, identity, expression, and style) that can be used both for discriminative tasks and for manipulating all four attributes in an explicit manner. The resulting architecture achieves state-of-the-art image quality, good discrimination and face retrieval results on each of the four attributes, and supports various face editing tasks using a face representation of only 99 dimensions. Finally, we apply the architecture's robust image synthesis capabilities to visually debug label-quality issues in an existing face dataset.
Organizers: Timo Bolkart
The reconstruction of 3D scenes and their appearance from imagery is one of the longest-standing problems in computer vision. Originally developed to support robotics and artificial intelligence applications, it has found some of its most widespread use in support of interactive 3D scene visualization. One of the keys to this success has been the melding of 3D geometric and photometric reconstruction with a heavy re-use of the original imagery, which produces more realistic rendering than a pure 3D model-driven approach. In this talk, I give a retrospective of two decades of research in this area, touching on topics such as sparse and dense 3D reconstruction, the fundamental concepts in image-based rendering and computational photography, applications to virtual reality, as well as ongoing research in the areas of layered decompositions and 3D-enabled video stabilization.
Organizers: Mohamed Hassan
Humans act upon their environment through motion, the ability to plan their movements is therefore an essential component of their autonomy. In recent decades, motion planning has been widely studied in robotics and computer graphics. Nevertheless robots still fail to achieve human reactivity and coordination. The need for more efficient motion planning algorithms has been present through out my own research on "human-aware" motion planning, which aims to take the surroundings humans explicitly into account. I believe imitation learning is the key to this particular problem as it allows to learn both, new motion skills and predictive models, two capabilities that are at the heart of "human-aware" robots while simultaneously holding the promise of faster and more reactive motion generation. In this talk I will present my work in this direction.
Two talks for the price of one! I will present my recent work on the challenging problem of stereo matching of scenes with little or no surface texture, attacking the problem from two very different angles. First, I will discuss how surface orientation priors can be added to the popular semi-global matching (SGM) algorithm, which significantly reduces errors on slanted weakly-textured surfaces. The orientation priors serve as a soft constraint during matching and can be derived in a variety of ways, including from low-resolution matching results and from monocular analysis and Manhattan-world assumptions. Second, we will examine the pathological case of Mondrian Stereo -- synthetic scenes consisting solely of solid-colored planar regions, resembling paintings by Piet Mondrian. I will discuss assumptions that allow disambiguating such scenes, present a novel stereo algorithm employing symbolic reasoning about matched edge segments, and discuss how similar ideas could be utilized in robust real-world stereo algorithms for untextured environments.
Organizers: Anurag Ranjan
Non-planar object deformations result in challenging but informative signal variations. We aim to recover this information in a feedforward manner by employing discriminatively trained convolutional networks. We formulate the task as a regression problem and train our networks by leveraging upon manually annotated correspondences between images and 3D surfaces. In this talk, the focus will be on our recent work "DensePose", where we form the "COCO-DensePose" dataset by introducing an efficient annotation pipeline to collect correspondences between 50K persons appearing in the COCO dataset and the SMPL 3D deformable human-body model. We use our dataset to train CNN-based systems that deliver dense correspondences 'in the wild', namely in the presence of background, occlusions, multiple objects and scale variations. We experiment with fully-convolutional networks and region-based DensePose-RCNN model and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly accurate results in real time (http://densepose.org).
Organizers: Georgios Pavlakos
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. In this talk, I will present my past and current work on Zero-Shot Learning, Vision and Language for Generative Modeling and Explainable Artificial Intelligence in that (1) how we can generalize the image classification models to the cases when no visual training data is available, (2) how to generate images and image features using detailed visual descriptions, and (3) how our models focus on discriminating properties of the visible object, jointly predict a class label,explain why the predicted label is appropriate for the image whereas another label is not.
Organizers: Andreas Geiger
Complex shapes can can be summarized using a coarsely defined structure which is consistent and robust across variety of observations. However, existing synthesis techniques do not consider structural decomposition during synthesis, causing generation of implausible or structurally unrealistic shapes. We explore how structure-aware reasoning can benefit existing generative techniques for complex 2D and 3D shapes. We evaluate our methodology on a 3D dataset of chairs and a 2D dataset of typefaces.
Organizers: Sergi Pujades
Organizers: Ahmed Osman
Visual perception involves a complex interaction between feedforward and feedback processes. A mechanistic understanding of these processing, and its limitations, is a necessary first step towards elucidating key aspects of perceptual functions and dysfunctions. In this talk, I will review our ongoing effort towards the understanding of how feedback visual processing operates at the level of the thalamus, a dynamic relay station halfway between the retina and the cortex. I will present experimental evidence from several recent electrophysiology studies performed on subjects engaged in visual detection tasks. The results show that modulatory driving provided by top-down processes (the feedback from primary visual cortex) critically influences the ongoing thalamic activity and shapes the message to be delivered to the cortex. When neuromodulatory techniques (Transcranial Magnetic Stimulation or static magnetic fields) are used to transiently disrupt cortical activity two very interesting effects show up: (1) alterations in stimulus detection and (2) the spatial properties of thalamic receptive fields are dramatically modified. Finally, I will show how sensory information can be a powerful tool to interact with the motor system and re-organize altered patterns of movement in neurological disorders such as Parkinson's disease.
Organizers: Daniel Cudeiro
Disney Research has been actively pushing the state-of-the-art in digitizing humans over the past decade, impacting both academia and industry. In this talk I will give an overview of a selected few projects in this area, from research into production. I will be talking about photogrammetric shape acquisition and dense performance capture for faces, eye and teeth scanning and parameterization, as well as physically based capture and modelling for hair and volumetric tissues.
Organizers: Timo Bolkart
The definition of art has been debated for more than 1000 years, and continues to be a puzzle. While scientific investigations offer hope of resolving this puzzle, machine learning classifiers that discriminate art from non-art images generally do not provide an explicit definition, and brain imaging and psychological theories are at present too coarse to provide a formal characterization. In this work, rather than approaching the problem using a machine learning approach trained on existing artworks, we hypothesize that art can be defined in terms of preexisting properties of the visual cortex. Specifically, we propose that a broad subset of visual art can be defined as patterns that are exciting to a visual brain. Resting on the finding that artificial neural networks trained on visual tasks can provide predictive models of processing in the visual cortex, our definition is operationalized by using a trained deep net as a surrogate “visual brain”, where “exciting” is defined as the activation energy of particular layers of this net. We find that this definition easily discriminates a variety of art from non-art, and further provides a ranking of art genres that is consistent with our subjective notion of ‘visually exciting’. By applying a deep net visualization technique, we can also validate the definition by generating example images that would be classified as art. The images synthesized under our definition resemble visually exciting art such as Op Art and other human- created artistic patterns.
Organizers: Michael Black