Since Hubel and Wiesel's seminal findings in the primary visual cortex (V1) more than 50 years ago, progress in vision science has been very limited along previous frameworks and schools of thoughts on understanding vision. Have we been asking the right questions? I will show observations motivating the new path. First, a drastic information bottleneck forces the brain to process only a tiny fraction of the massive visual input information; this selection is called the attentional selection, how to select this tiny fraction is critical. Second, a large body of evidence has been accumulating to suggest that the primary visual cortex (V1) is where this selection starts, suggesting that the visual cortical areas along the visual pathway beyond V1 must be investigated in light of this selection in V1. Placing attentional selection as the center stage, a new path to understanding vision is proposed (articulated in my book "Understanding vision: theory, models, and data", Oxford University Press 2014). I will show a first example of using this new path, which aims to ask new questions and make fresh progresses. I will relate our insights to artificial vision systems to discuss issues like top-down feedbacks in hierachical processing, analysis-by-synthesis, and image understanding.
In this talk we present some recent results on human action recognition in videos. We, first, show how to use human pose for action recognition. To this end we propose a new pose-based convolutional neural network descriptor for action recognition, which aggregates motion and appearance information along tracks of human body parts. Next, we present an approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and then tracks high-scoring proposals in the video. Our tracker relies simultaneously on instance-level and class-level detectors. Action are localized in time with a sliding window approach at the track level. Finally, we show how to extend this method to weakly supervised learning of actions, which allows to scale to large amounts of data without manual annotation.
Typical human actions such as hand-shaking and drinking last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of single frames or short video clips and fail to model actions at their full temporal scale. In this work we learn video representations using neural networks with long-term temporal convolutions. We demonstrate that CNN models with increased temporal extents improve the accuracy of action recognition despite reduced spatial resolution. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 and HMDB51.
Proper handling of occlusions is a big challenge for model based reconstruction, e.g. for multi-view motion capture a major difficulty is the handling of occluding body parts. We propose a smooth volumetric scene representation, which implicitly converts occlusion into a smooth and differentiable phenomena (ICCV2015). Our ray tracing image formation model helps to express the objective in a single closed-form expression. This is in contrast to existing surface(mesh) representations, where occlusion is a local effect, causes non-differentiability, and is difficult to optimize. We demonstrate improvements for multi-view scene reconstruction, rigid object tracking, and motion capture. Moreover, I will show an application of motion tracking to the interactive control of virtual characters (SigAsia2015).
The core focus of my research is on robot perception. Within this broad categorization, I am mainly interested in understanding how teams of robots and sensors can cooperate and/or collaborate to improve the perception of themselves (self-localization) as well as their surroundings (target tracking, mapping, etc.). In this talk I will describe the inter-dependencies of such perception modules and present state-of-the-art methods to perform unified cooperative state estimation. The trade-off between accuracy of estimation and computational speed will be highlighted through a new optimization-based method for unified-state estimation. Furthermore, I will also describe how perception-based multirobot formation control can be achieved. Towards the end, I will present some recent results on cooperative vision-based target tracking and a few comments on our ongoing work regarding cooperative aerial mapping with human-in-the-loop.
Modeling and reconstruction of shape and motion are problems of fundamental importance in computer vision. Inverse Problem theory constitutes a powerful mathematical framework for dealing with ill-posed problems as the ones typically arising in shape and motion modeling. In this talk, I will present methods inspired by Inverse Problem theory, for dealing with four different shape and motion modeling problems. In particular, in the context of shape modeling, I will present a method for component-wise modeling of articulated objects and its application in computing 3D models of animals. Additionally, I will discuss the problem of modeling of specular surfaces via the properties of their material, and I will also present a model for confidence driven depth image fusion based on total variation regularization. Regarding motion, I will discuss a method for the recognition of human actions from motion capture data based on Nonparametric Bayesian models.
Inverse problems are ubiquitous in image processing and applied science in general. Such problems describe the challenge of computing the parameters that characterize a system from the outcomes. While this might seem easy at first for simple systems, many inverse problems share a property that makes them much more intricate: they are ill-posed. This means that either the problem does not have a unique solution or this solution does not depend continuously on the outcomes of the system. Bayesian statistics provides a framework that allows to treat such problems in a systematic way. The missing piece of information is encoded as a prior distribution on the space of possible solutions. In this talk, we will study probabilistic image models as priors for statistical inversion. In particular, we will give a probabilistic interpretation of the classical TV-prior and discuss how this interpretation can be used as a starting point for more complex models. We will see that many important auxiliary quantities such as edges and regions can be incorporated into the model in the form of latent variables. This leads to the conjecture that many image processing tasks, such as denoising and segmentation, should not be considered separately, but instead be treated together.
In general Helga Griffiths is a Multi-Sense-Artist working on the intersection of science and art. She has been working for over 20 years on the integration of various sensory stimuli into her “multi-sense” installations. Typical for her work is to produce a sensory experience to transcend conventional boundaries of perception.
Organizers: Emma-Jayne Holderness
I will describe a series of work that aims to automatically understand images of animals and plants. I will begin by describing recent work that uses Bounded Distortion matching to model pose variation in animals. Using a generic 3D model of an animal and multiple images of different individuals in various poses, we construct a model that captures the way in which the animal articulates. This is done by solving for the pose of the template that matches each image while simultaneously solving for the stiffness of each tetrahedron of the model. We minimize an L1 norm on stiffness, producing a model that bends easily at joints, but that captures the rigidity of other parts of the animal. We show that this model can determine the pose of animals such as cats in a wide range of positions. Bounded distortion forms a core part of the matching between 3D model and 2D images. I will also show that Bounded Distortion can be used for 2D matching. We use it to find corresponding features in images very robustly, optimizing an L0 distance to maximize the number of matched features, while bounding the amount of non-rigid variation between the images. We demonstrate the use of this approach in matching non-rigid objects and in wide-baseline matching of features. I will also give an overview of a method for identifying the parts of animals in images, to produce an automatic correspondence between images of animals. Building on these correspondences we develop methods for recognizing the species of a bird, or the breed of a dog. We use these recognition algorithms to construct electronic field guides. I will describe three field guides that we have published, Birdsnap, Dogsnap, and Leafsnap. Leafsnap identifies the species of trees using shape-based matching to compare images of leaves. Leafsnap has been downloaded by over 1.5 million users, and has been used in schools and in biodiversity studies. This work has been done in collaboration with many University of Maryland students and with groups at Columbia University, the Smithsonian Institution National Museum of Natural History, and the Weizmann Institute.
Organizers: Stephan Streuber
The design of tangent vector fields on discrete surfaces is a basic building block for many geometry processing applications, such as surface remeshing, parameterization and architectural geometric design. Many applications require the design of multiple vector fields (vector sets) coupled in a nontrivial way; for example, sets of more than two vectors are used for meshing of triangular, quadrilateral and hexagonal meshes. In this talk, a new, polynomial-based representation for general unordered vector sets will be presented. Using this representation we can efficiently interpolate user provided vector constraints to design vector set fields. Our interpolation scheme will require neither integer period jumps, nor explicit pairings of vectors between adjacent sets on a manifold, as is common in field design literature. Several extensions to the basic interpolation scheme are possible, which make our representation applicable in various scenarios; in this talk, we will focus on generating vector set fields particularly suited for mesh parameterization and show applications in architectural modeling.
Organizers: Gerard Pons-Moll
The recent amazing success of deep learning has been mainly in discriminative learning, that is, classification and regression. An important factor for this success has been, besides Moore's law, the availability of large labeled datasets. However, it is not clear whether in the future the amount of available labels grows as fast as the amount of unlabeled data, providing one argument to be interested in unsupervised and semi-supervised learning. Besides this there are a number of other reasons why unsupervised learning is still important, such as the fact that data in the life sciences often has many more features than instances (p>>n), the fact that probabilities over feature space are useful for planning and control problems and the fact that complex simulator models are the norm in the sciences. In this talk I will discuss deep generative models that can be jointly trained with discriminative models and that facilitate semi-supervised learning. I will discuss recent progress in learning and Bayesian inference in these "variational auto-encoders". I will then extend the deep generative models to the class of simulators for which no tractable likelihood exists and discuss new Bayesian inference procedures to fit these models to data.
Organizers: Peter Vincent Gehler