Human pose stability analysis is the key to understanding locomotion and control of body equilibrium, with numerous applications in the fields of Kinesiology, Medicine and Robotics. We propose and validate a novel approach to learn dynamics from kinematics of a human body to aid stability analysis. More specifically, we propose an end-to-end deep learning architecture to regress foot pressure from a human pose derived from video. We have collected and utilized a set of long (5min +) choreographed Taiji (Tai Chi) sequences of multiple subjects with synchronized motion capture, foot pressure and video data. The derived human pose data and corresponding foot pressure maps are used jointly in training a convolutional neural network with residual architecture, named “PressNET”. Cross validation results show promising performance of PressNet, significantly outperforming the baseline method under reasonable sensor noise ranges.
Organizers: Nadine Rueegg
Understanding objects and their behavior from images and videos is a difficult inverse problem. It requires learning a metric in image space that reflects object relations in real world. This metric learning problem calls for large volumes of training data. While images and videos are easily available, labels are not, thus motivating self-supervised metric and representation learning. Furthermore, I will present a widely applicable strategy based on deep reinforcement learning to improve the surrogate tasks underlying self-supervision. Thereafter, the talk will cover the learning of disentangled representations that explicitly separate different object characteristics. Our approach is based on an analysis-by-synthesis paradigm and can generate novel object instances with flexible changes to individual characteristics such as their appearance and pose. It nicely addresses diverse applications in human and animal behavior analysis, a topic we have intensive collaboration on with neuroscientists. Time permitting, I will discuss the disentangling of representations from a wider perspective including novel strategies to image stylization and new strategies for regularization of the latent space of generator networks.
Organizers: Joel Janai
The past few years with the advent of Deep Convolutional Neural Networks (DCNNs), as well as the availability of visual data it was shown that it is possible to produce excellent results in very challenging tasks, such as visual object recognition, detection, tracking etc. Nevertheless, in certain tasks such as fine-grain object recognition (e.g., face recognition) it is very difficult to collect the amount of data that are needed. In this talk, I will show how, using DCNNs, we can generate highly realistic faces and heads and use them for training algorithms such as face and facial expression recognition. Next, I will reverse the problem and demonstrate how by having trained a very powerful face recognition network it can be used to perform very accurate 3D shape and texture reconstruction of faces from a single image. Finally, I will demonstrate how to create very lightweight networks for representing 3D face texture and shape structure by capitalising upon intrinsic mesh convolutions.
Organizers: Dimitris Tzionas
In this talk, I will present my understanding on 3D face reconstruction, modelling and applications from a deep learning perspective. In the first part of my talk, I will discuss the relationship between representations (point clouds, meshes, etc) and network layers (CNN, GCN, etc) on face reconstruction task, then present my ECCV work PRN which proposed a new representation to help achieve state-of-the-art performance on face reconstruction and dense alignment tasks. I will also introduce my open source project face3d that provides examples for generating different 3D face representations. In the second part of the talk, I will talk some publications in integrating 3D techniques into deep networks, then introduce my upcoming work which implements this. In the third part, I will present how related tasks could promote each other in deep learning, including face recognition for face reconstruction task and face reconstruction for face anti-spoofing task. Finally, with such understanding of these three parts, I will present my plans on 3D face modelling and applications.
Organizers: Timo Bolkart
Much existing work in reinforcement learning involves environments that are either intentionally neutral, lacking a role for cooperation and competition, or intentionally simple, when agents need imagine nothing more than that they are playing versions of themselves. Richer game theoretic notions become important as these constraints are relaxed. For humans, this encompasses issues that concern utility, such as envy and guilt, and that concern inference, such as recursive modeling of other players, I will discuss studies treating a paradigmatic game of trust as an interactive partially-observable Markov decision process, and will illustrate the solution concepts with evidence from interactions between various groups of subjects, including those diagnosed with borderline and anti-social personality disorders.
The external world is represented in the brain as spatiotemporal patterns of electrical activity. Sensory signals, such as light, sound, and touch, are transduced at the periphery and subsequently transformed by various stages of neural circuitry, resulting in increasingly abstract representations through the sensory pathways of the brain. It is these representations that ultimately give rise to sensory perception. Deciphering the messages conveyed in the representations is often referred to as “reading the neural code”. True understanding of the neural code requires knowledge of not only the representation of the external world at one particular stage of the neural pathway, but ultimately how sensory information is communicated from the periphery to successive downstream brain structures. Our laboratory has focused on various challenges posed by this problem, some of which I will discuss. In contrast, prosthetic devices designed to augment or replace sensory function rely on the principle of artificially activating neural circuits to induce a desired perception, which we might refer to as “writing the neural code”. This requires not only significant challenges in biomaterials and interfaces, but also in knowing precisely what to tell the brain to do. Our laboratory has begun some preliminary work in this direction that I will discuss. Taken together, an understanding of these complexities and others is critical for understanding how information about the outside world is acquired and communicated to downstream brain structures, in relating spatiotemporal patterns of neural activity to sensory perception, and for the development of engineered devices for replacing or augmenting sensory function lost to trauma or disease.
Organizers: Jonas Wulff
Learning of layered or "deep" representations has provided significant advances in computer vision in recent years, but has traditionally been limited to fully supervised settings with very large amounts of training data. New results show that such methods can also excel when learning in sparse/weakly labeled settings across modalities and domains. I'll present our recent long-term recurrent network model which can learn cross-modal translation and can provide open-domain video to text transcription. I'll also describe state-of-the-art models for fully convolutional pixel-dense segmentation from weakly labeled input, and finally will discuss new methods for adapting deep recognition models to new domains with few or no target labels for categories of interest.
Organizers: Jonas Wulff
I will talk about two types of machine learning problems, which are important but have received little attention. The first are problems naturally formulated as learning a one-to-many mapping, which can handle the inherent ambiguity in tasks such as generating segmentations or captions for images. A second problem involves learning representations that are invariant to certain nuisance or sensitive factors of variation in the data while retaining as much of the remaining information as possible. The primary approach we formulate for both problems is a constrained form of joint embedding in a deep generative model, that can develop informative representations of sentences and images. Applications discussed will include image captioning, question-answering, segmentation, classification without discrimination, and domain adaptation.
Organizers: Gerard Pons-Moll
During the last three decades computer graphics established itself as a core discipline within computer science and information technology. Two decades ago, most digital content was textual. Today it has expanded to include audio, images, video, and a variety of graphical representations. New and emerging technologies such as multimedia, social networks, digital television, digital photography and the rapid development of new sensing devices, telecommunication and telepresence, virtual reality, or 3D-internet further indicate the potential of computer graphics in the years to come. Typical for the field is the coincidence of very large data sets with the demand for fast, and possibly interactive, high quality visual feedback. Furthermore, the user should be able to interact with the environment in a natural and intuitive way. In order to address the challenges mentioned above, a new and more integrated scientific view of computer graphics is required. In contrast to the classical approach to computer graphics which takes as input a scene model -- consisting of a set of light sources, a set of objects (specified by their shape and material properties), and a camera -- and uses simulation to compute an image, we like to take the more integrated view of `3D Image Analysis and Synthesis’ for our research. We consider the whole pipeline from data acquisition, over data processing to rendering in our work. In our opinion, this point of view is necessary in order to exploit the capabilities and perspectives of modern hardware, both on the input (sensors, scanners, digital photography, digital video) and output (graphics hardware, multiple platforms) side. Our vision and long term goal is the development of methods and tools to efficiently handle the huge amount of data during the acquisition process, to extract structure and meaning from the abundance of digital data, and to turn this into graphical representations that facilitate further processing, rendering, and interaction. In this presentation I will highlight some of our ongoing research by means of examples. Topics covered include 3D reconstruction and digital geometry processing, shape analysis and shape design, motion and performance capture, and 3D video processing.
Learnable representations, and deep convolutional neural networks (CNNs) in particular, have become the preferred way of extracting visual features for image understanding tasks, from object recognition to semantic segmentation. In this talk I will discuss several recent advances in deep representations for computer vision. After reviewing modern CNN architectures, I will give an example of a state-of-the-art network in text spotting; in particular, I will show that, by using only synthetic data and a sufficiently large deep model, it is possible directly map image regions to English words, a classification problem with 90K classes, obtaining in this manner state-of-the-art performance in text spotting. I will also briefly touch on other applications of deep learning to object recognition and discuss feature universality and transfer learning. In the last part of the talk I will move to the problem of understanding deep networks, which remain largely black boxes, presenting two possible approaches to their analysis. The first one are visualisation techniques that can investigate the information retained and learned by a visual representation. The second one is a method that allows exploring how representation capture geometric notions such as image transformations, and to find whether different representations are related and how.
Recent progress in computer-based visual recognition heavily relies on machine learning methods trained using large scale annotated datasets. While such data has made advances in model design and evaluation possible, it does not necessarily provide insights or constraints into those intermediate levels of computation, or deep structure, perceived as ultimately necessary in order to design reliable computer vision systems. This is noticeable in the accuracy of state of the art systems trained with such annotations, which still lag behind human performance in similar tasks. Nor does the existing data makes it immediately possible to exploit insights from a working system - the human eye - to derive potentially better features, models or algorithms. In this talk I will present a mix of perceptual and computational insights resulted from the analysis of large-scale human eye movement and 3d body motion capture datasets, collected in the context of visual recognition tasks (Human3.6M available at http://vision.imar.ro/human3.6m/, and Actions in the Eye available at http://vision.imar.ro/eyetracking/). I will show that attention models (fixation detectors, scan-paths estimators, weakly supervised object detector response functions and search strategies) can be learned from human eye movement data, and can produce state of the art results when used in end-to-end automatic visual recognition systems. I will also describe recent work in large-scale human pose estimation, showing the feasibility of pixel-level body part labeling from RGB, and towards promising 2D and 3D human pose estimation results in monocular images.In this context, I will discuss perceptual, perhaps surprising recent quantitative experiments, revealing that humans may not be significantly better than computers at perceiving 3D articulated poses from monocular images. Such findings may challenge established definitions of computer vision `tasks' and their expected levels of performance.
The breast is not just a protruding gland situated on the front of the thorax in female bodies: behind biology lies an intricate symbolism that has taken various and often contradictory meanings. We begin our journey looking at pre-historic artifacts that revered the breast as the ultimate symbol of life; we then transition to the rich iconographical tradition centering on the so-called Virgo Lactans when the breast became a metaphor of nourishment for the entire Christian community. Next, we look at how artists have eroticized the breast in portraits of fifteenth-century French courtesans and how enlightenment philosophers and revolutionary events have transformed it into a symbol of the national community. Lastly, we analyze how contemporary society has medicalized the breast through cosmetic surgery and discourses around breast cancer, and has objectified it by making the breast a constant presence in advertisement and magazine covers. Through twenty-five centuries of representations, I will talk about how the breast has been coded as both "good" and "bad," sacred and erotic, life-giving and life-destroying.
How is it that biological systems can be so imprecise, so ad hoc, and so inefficient, yet accomplish (seemingly) simple tasks that still elude state-of-the-art artificial systems? In this context, I will introduce some of the themes central to CMU's new BrainHub Initiative by discussing: (1) The complexity and challenges of studying the mind and brain; (2) How the study of the mind and brain may benefit from considering contemporary artificial systems; (3) Why studying the mind and brain might be interesting (and possibly useful) to computer scientists.
In this talk I will give an overview of work I have done over the years exploring physically based simulation of contact, deformation, and articulated structures where there are trade-offs between computational speed and physical fidelity that can be made. I will also discuss examples that mix data-driven and physically based approaches in animation and control.
Paul Kry is an associate professor in the School of Computer Science at McGill University. He has a BMath from University of Waterloo, and MSc and PhD from University of British Columbia. His research focuses on physically based simulation, motion capture, and control of character animation.
Everyone in visual psychology seems to know what Biological Motion is. Yet, it is not easy to come up with a definition that is specific enough to justify a distinct label, but is also general enough to include the many different experiments to which the term has been applied in the past. I will present a number of tasks, stimuli, and experiments, including some of my own work, to demonstrate the diversity and the appeal of the field of biological motion perception. In trying to come up with a definition of the term, I will particularly focus on a type of motion that has been considered “non-biological” in some contexts, even though it might contain -- as more recent work shows -- one of the most important visual invariants used by the visual system to distinguish animate from inanimate motion.