Deep learning has significantly advanced state-of-the-art for 3D hand pose estimation, of which accuracy can be improved with increased amounts of labelled data. However, acquiring 3D hand pose labels can be extremely difficult. In this talk, I will present our recent two works on leveraging self-supervised learning techniques for hand pose estimation from depth map. In both works, we incorporate differentiable renderer to the network and formulate training loss as model fitting error to update network parameters. In first part of the talk, I will present our earlier work which approximates hand surface with a set of spheres. We then model the pose prior as a variational lower bound with variational auto-encoder(VAE). In second part, I will present our latest work on regressing the vertex coordinates of a hand mesh model with 2D fully convolutional network(FCN) in a single forward pass. In the first stage, the network estimates a dense correspondence field for every pixel on the image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. Without any human annotation, both works can perform competitively with strongly supervised methods. The later work will also be later extended to be compatible with MANO model.
Organizers: Dimitrios Tzionas
In this talk I will give an overview of work I have done over the years exploring physically based simulation of contact, deformation, and articulated structures where there are trade-offs between computational speed and physical fidelity that can be made. I will also discuss examples that mix data-driven and physically based approaches in animation and control.
Paul Kry is an associate professor in the School of Computer Science at McGill University. He has a BMath from University of Waterloo, and MSc and PhD from University of British Columbia. His research focuses on physically based simulation, motion capture, and control of character animation.
Everyone in visual psychology seems to know what Biological Motion is. Yet, it is not easy to come up with a definition that is specific enough to justify a distinct label, but is also general enough to include the many different experiments to which the term has been applied in the past. I will present a number of tasks, stimuli, and experiments, including some of my own work, to demonstrate the diversity and the appeal of the field of biological motion perception. In trying to come up with a definition of the term, I will particularly focus on a type of motion that has been considered “non-biological” in some contexts, even though it might contain -- as more recent work shows -- one of the most important visual invariants used by the visual system to distinguish animate from inanimate motion.
We present an approach to creating 3D models of objects depicted in Web images, even when each object may only be shown in a single image. Our approach uses a comparatively small collection of existing 3D models to guide the reconstruction process. These existing shapes are used to derive information about shape structure. Our guiding idea is to jointly analyze the images and the available 3D models. Joint analysis of all images along with the available shapes regularizes the formulated optimization problems, stabilizes estimation of camera parameters and construction of dense pixel-level correspondences, and leads to reasonable reproduction of object appearance in the absence of traditional multi-view cues. Joint work with Qixing Huang and Hai Wang.
Image-based rendering has been introduced in the 1990s as an alternative approach to photorealistic rendering. Its key idea is to novel renderings by re-projecting pixels from nearby views. The basic approach works well for many scenes but breaks down if the scene contains “non-standard” elements such as reflective surfaces. In this talk, I will first show how we can extend image-based rendering to handle scenes with reflections. I will then discuss a novel gradient-based technique for image-based rendering that can intrinsically handle scenes with reflections.
Driven by the increasing demand for photorealistic computer-generated images, graphics is currently undergoing a substantial transformation to physics-based approaches which accurately reproduce the interaction of light and matter. Progress on both sides of this transformation -- physical models and simulation techniques -- has been steady but mostly independent from another. When combined, the resulting methods are in many cases impracticably slow and require unrealistic workarounds to process even simple everyday scenes. My research lies at the interface of these two research fields; my goal is to break down the barriers between simulation techniques and the underlying physical models, and to use the resulting insights to develop realistic methods that remain efficient over a wide range of inputs.
I will cover three areas of recent work: the first involves volumetric modeling approaches to create realistic images of woven and knitted cloth. Next, I will discuss reflectance models for glitter/sparkle effects and arbitrarily layered materials that are specially designed to allow for efficient simulations. In the last part of the talk, I will give an overview of Manifold Exploration, a Markov Chain Monte Carlo technique that is able to reason about the geometric structure of light paths in high dimensional configuration spaces defined by the underlying physical models, and which uses this information to compute images more efficiently.
I will present selected research projects of the Photogrammetry and Remote Sensing Group at ETH, including (i) 3D scene flow estimation for stereo video captured from a car; (ii) extraction of road networks from aerial images; and (iii) 3D reconstruction from large, unstructured (e.g. crowd-sourced) image collections.
The growing scale of image and video datasets in vision makes labeling and annotation of such datasets, for training of recognition models, difficult and time consuming. Further, richer models often require richer labelings of the data, that are typically even more difficult to obtain. In this talk I will focus on two models that make use of different forms of supervision for two different vision tasks.
In the first part of this talk I will focus on object detection. The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level labels (e.g., “car”). We postulate that having a richer set of labelings (at different levels of granularity) for an object, including finer-grained sub-categories, consistent in appearance and view, and higher-order composites – contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is infeasible. To this end, we propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input.
In the second part of the talk I will focus on the framework for large scale image set and video summarization. Starting from the intuition that the characteristics of the two media types are different but complementary, we develop a fast and easily-parallelizable approach for creating not only video summaries but also novel structural summaries of events in the form of the storyline graphs. The storyline graphs can illustrate various events or activities associated with the topic in the form of a branching directed network. The video summarization is achieved by diversity ranking on the similarity graphs between images and video frame, thereby treating consumer image as essentially a form of weak-supervision. The reconstruction of storyline graphs on the other hand is formulated as inference of the sparse time-varying directed graphs from a set of photo streams with assistance of consumer videos.
Time permitting I will also talk about a few other recent project highlights.
Abstract: I will present a general framework for modelling and recovering 3D shape and pose using subdivision surfaces. To demonstrate this frameworks generality, I will show how to recover both a personalized rigged hand model from a sequence of depth images and a blend shape model of dolphin pose from a collection of 2D dolphin images. The core requirement is the formulation of a generative model in which the control vertices of a smooth subdivision surface are parameterized (e.g. with joint angles or blend weights) by a differentiable deformation function. The energy function that falls out of measuring the deviation between the surface and the observed data is also differentiable and can be minimized through standard, albeit tricky, gradient based non-linear optimization from a reasonable initial guess. The latter can often be obtained using machine learning methods when manual intervention is undesirable. Satisfyingly, the "tricks" involved in the former are elegant and widen the applicability of these methods.
In order to avoid an expensive manual labeling process or to learn object classes autonomously without human intervention, object discovery techniques have been proposed that extract visual similar objects from weakly labelled videos. However, the problem of discovering small or medium sized objects is largely unexplored. We observe that videos with activities involving human-object interactions can serve as weakly labelled data for such cases. Since neither object appearance nor motion is distinct enough to discover objects in these videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of object proposals. Furthermore, we model similarity of objects based on appearance and functionality, which is derived from human and object motion. We show that functionality is an important cue for discovering objects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.
In this talk I will discuss two related problems in 3D reconstruction: (i) recovering the 3D shape of a temporally varying non-rigid 3D surface given a single video sequence and (ii) reconstructing different instances of the same object class category given a large collection of images from that category. In both cases we extract dense 3D shape information by analysing shape variation -- in one case of the same object instance over time and in the other across different instances of objects that belong to the same class.
First I will discuss the problem of dense capture of 3D non-rigid surfaces from a monocular video sequence. We take a purely model-free approach where no strong assumptions are made about the object we are looking at or the way it deforms. We apply low rank and spatial smoothness priors to obtain dense non-rigid models using a variational approach.
Second I will describe our recent approach to populating the Pascal VOC dataset with dense, per-object 3D reconstructions, bootstrapped from class labels, ground truth figure-ground segmentations and a small set of keypoint annotations. Our proposed algorithm first estimates camera viewpoint using rigid structure-from-motion, then reconstructs objects shapes by optimizing over visual hull proposals guided by loose within-class shape similarity assumptions.