Human pose stability analysis is the key to understanding locomotion and control of body equilibrium, with numerous applications in the fields of Kinesiology, Medicine and Robotics. We propose and validate a novel approach to learn dynamics from kinematics of a human body to aid stability analysis. More specifically, we propose an end-to-end deep learning architecture to regress foot pressure from a human pose derived from video. We have collected and utilized a set of long (5min +) choreographed Taiji (Tai Chi) sequences of multiple subjects with synchronized motion capture, foot pressure and video data. The derived human pose data and corresponding foot pressure maps are used jointly in training a convolutional neural network with residual architecture, named “PressNET”. Cross validation results show promising performance of PressNet, significantly outperforming the baseline method under reasonable sensor noise ranges.
Organizers: Nadine Rueegg
Recognition of pain in horses and other animals is important, because pain is a manifestation of disease and decreases animal welfare. Pain diagnostics for humans typically includes self-evaluation and location of the pain with the help of standardized forms, and labeling of the pain by an clinical expert using pain scales. However, animals cannot verbalize their pain as humans can, and the use of standardized pain scales is challenged by the fact that animals as horses and cattle, being prey animals, display subtle and less obvious pain behavior - it is simply beneficial for a prey animal to appear healthy, in order lower the interest from predators. We work together with veterinarians to develop methods for automatic video-based recognition of pain in horses. These methods are typically trained with video examples of behavioral traits labeled with pain level and pain characteristics. This automated, user independent system for recognition of pain behavior in horses will be the first of its kind in the world. A successful system might change the concept for how we monitor and care for our animals.
In this talk, I will present an overview of my Ph.D. research towards articulated human pose estimation from unconstrained images and videos. In the first part of the talk, I will present an approach to jointly model multi-person pose estimation and tracking in a single formulation. The approach represents body joint detections in a video by a spatiotemporal graph and solves an integer linear program to partition the graph into sub-graphs that correspond to plausible body pose trajectories for each person. I will also introduce the PoseTrack dataset and benchmark which is now the de-facto standard for multi-person pose estimation and tracking. In the second half of the talk, I will present a new method for 3D pose estimation from a monocular image through a novel 2.5D pose representation. The new 2.5D representation can be reliably estimated from an RGB image. Furthermore, it allows to exactly reconstruct the absolute 3D body pose up to a scaling factor, which can be estimated additionally if a prior of the body size is given. I will also describe a novel CNN architecture to implicitly learn the heatmaps and depth-maps for human body key-points from a single RGB image.
Organizers: Dimitrios Tzionas
Supervised learning with deep convolutional networks is the workhorse of the majority of computer vision research today. While much progress has been made already, exploiting deep architectures with standard components, enormous datasets, and massive computational power, I will argue that it pays to scrutinize some of the components of modern deep networks. I will begin with looking at the common pooling operation and show how we can replace standard pooling layers with a perceptually-motivated alternative, with consistent gains in accuracy. Next, I will show how we can leverage self-similarity, a well known concept from the study of natural images, to derive non-local layers for various vision tasks that boost the discriminative power. Finally, I will present a lightweight approach to obtaining predictive probabilities in deep networks, allowing to judge the reliability of the prediction.
Organizers: Michael Black
This talk aims to argue for a fine-grained perspective onto human-object interactions, from video sequences. I will present approaches for the understanding of ‘what’ objects one interacts with during daily activities, ‘when’ should we label the temporal boundaries of interactions, ‘which’ semantic labels one can use to describe such interactions and ‘who’ is better when contrasting people perform the same interaction. I will detail my group’s latest works on sub-topics related to: (1) assessing action ‘completion’ – when an interaction is attempted but not completed [BMVC 2018], (2) determining skill or expertise from video sequences [CVPR 2018] and (3) finding unequivocal semantic representations for object interactions [ongoing work]. I will also introduce EPIC-KITCHENS 2018, the recently released largest dataset of object interactions in people’s homes, recorded using wearable cameras. The dataset includes 11.5M frames fully annotated with objects and actions, based on unique annotations from the participants narrating their own videos, thus reflecting true intention. Three open challenges are now available on object detection, action recognition and action anticipation [http://epic-kitchens.github.io]
Organizers: Mohamed Hassan
In this talk, I will take an autobiographical approach to explain both where we have come from in computer graphics from the early days of rendering, and to point towards where we are going in this new world of smartphones and social media. We are at a point in history where the abilities to express oneself with media is unparalleled. The ubiquity and power of mobile devices coupled with new algorithmic paradigms is opening new expressive possibilities weekly. At the same time, these new creative media (composite imagery, augmented imagery, short form video, 3D photos) also offer unprecedented abilities to move freely between what is real and unreal. I will focus on the spaces in between images and video, and in between objective and subjective reality. Finally, I will close with some lessons learned along the way.
In this talk I will be presenting recent work on combining ideas from deformable models with deep learning. I will start by describing DenseReg and DensePose, two recently introduced systems for establishing dense correspondences between 2D images and 3D surface models ``in the wild'', namely in the presence of background, occlusions, and multiple objects. For DensePose in particular we introduce DensePose-COCO, a large-scale dataset for dense pose estimation, and DensePose-RCNN, a system which operates at multiple frames per second on a single GPU while handling multiple humans simultaneously. I will then present Deforming AutoEncoders, a method for unsupervised dense correspondence estimation. We show that we can disentangle deformations from appearance variation in an entirely unsupervised manner, and also provide promising results for a more thorough disentanglement of images into deformations, albedo and shading. Time permitting we will discuss a parallel line of work aiming at combining grouping with deep learning, and see how both grouping and correspondence can be understood as establishing associations between neurons.
Organizers: Vassilis Choutas
The reconstruction of 3D scenes and their appearance from imagery is one of the longest-standing problems in computer vision. Originally developed to support robotics and artificial intelligence applications, it has found some of its most widespread use in support of interactive 3D scene visualization. One of the keys to this success has been the melding of 3D geometric and photometric reconstruction with a heavy re-use of the original imagery, which produces more realistic rendering than a pure 3D model-driven approach. In this talk, I give a retrospective of two decades of research in this area, touching on topics such as sparse and dense 3D reconstruction, the fundamental concepts in image-based rendering and computational photography, applications to virtual reality, as well as ongoing research in the areas of layered decompositions and 3D-enabled video stabilization.
Organizers: Mohamed Hassan
Humans act upon their environment through motion, the ability to plan their movements is therefore an essential component of their autonomy. In recent decades, motion planning has been widely studied in robotics and computer graphics. Nevertheless robots still fail to achieve human reactivity and coordination. The need for more efficient motion planning algorithms has been present through out my own research on "human-aware" motion planning, which aims to take the surroundings humans explicitly into account. I believe imitation learning is the key to this particular problem as it allows to learn both, new motion skills and predictive models, two capabilities that are at the heart of "human-aware" robots while simultaneously holding the promise of faster and more reactive motion generation. In this talk I will present my work in this direction.
Two talks for the price of one! I will present my recent work on the challenging problem of stereo matching of scenes with little or no surface texture, attacking the problem from two very different angles. First, I will discuss how surface orientation priors can be added to the popular semi-global matching (SGM) algorithm, which significantly reduces errors on slanted weakly-textured surfaces. The orientation priors serve as a soft constraint during matching and can be derived in a variety of ways, including from low-resolution matching results and from monocular analysis and Manhattan-world assumptions. Second, we will examine the pathological case of Mondrian Stereo -- synthetic scenes consisting solely of solid-colored planar regions, resembling paintings by Piet Mondrian. I will discuss assumptions that allow disambiguating such scenes, present a novel stereo algorithm employing symbolic reasoning about matched edge segments, and discuss how similar ideas could be utilized in robust real-world stereo algorithms for untextured environments.
Organizers: Anurag Ranjan