Department Talks
  • Christian Häne
  • MRC-SR

Volumetric 3D modeling has attracted a lot of attention in the past. In this talk I will explain how the standard volumetric formulation can be extended to include semantic information by using a convex multi-label formulation. One of the strengths of our formulation is that it allows us to directly account for the expected surface orientations. I will focus on two applications. Firstly, I will introduce a method that allows for joint volumetric reconstruction and class segmentation. This is achieved by taking into account the expected orientations of object classes such as ground and building. Such a joint approach considerably improves the quality of the geometry while at the same time it gives a consistent semantic segmentation. In the second application I will present a method that allows for the reconstruction of challenging objects such as for example glass bottles. The main difficulty with reconstructing such objects are the texture-less, transparent and reflective areas in the input images. We propose to formulate a shape prior based on the locally expected surface orientation to account for the ambiguous input data. Our multi-label approach also directly enables us to segment the object from its surrounding.

Towards Lifelong Learning for Visual Scene Understanding

IS Colloquium
  • 12 May 2014 • 11:15
  • Christoph Lampert
  • Max Planck House Lecture Hall

The goal of lifelong visual learning is to develop techniques that continuously and autonomously learn from visual data, potentially for years or decades. During this time the system should build an ever-improving base of generic visual information, and use it as background knowledge and context for solving specific computer vision tasks. In my talk, I will highlight two recent results from our group on the road towards lifelong visual scene understanding: the derivation of theoretical guarantees for lifelong learning systems and the development of practical methods for object categorization based on semantic attributes.

Organizers: Gerard Pons-Moll

  • Nikolaus Troje
  • MRC Seminar room (0.A.03)

Point-light walkers and stick figures rendered orthographically and without self-occlusion do not contain any information as to their depth. For instance, a frontoparallel projection could depict a walker from the front or from the back. Nevertheless, observers show a strong bias towards seeing the walker as facing the viewer. A related stimulus, the silhouette of a human figure, does not seem to show such a bias. We develop these observations into a tool to study the cause of the facing the viewer bias observed for biological motion displays.

I will give a short overview about existing theories with respect to the facing-the-viewer bias, and about a number of findings that seem hard to explain with any single one of them. I will then present the results of our studies on both stick figures and silhouettes which gave rise to a new theory about the facing the viewer bias, and I will eventually present an experiment that tests a hypothesis resulting from it. The studies are discussed in the context of one of the most general problems the visual system has to solve: How do we disambiguate an initially ambiguous sensory world and eventually arrive at the perception of a stable, predictable "reality"?

Video Segmentation

IS Colloquium
  • 05 May 2014 • 09:15:00
  • Thomas Brox
  • Max Planck House Lecture Hall

Compared to static image segmentation, video segmentation is still in its infancy. Various research groups have different tasks in mind when they talk of video segmentation. For some it is motion segmentation, some think of an over-segmentation with thousands of regions per video, and others understand video segmentation as contour tracking. I will go through what I think are reasonable video segmentation subtasks and will touch the issue of benchmarking. I will also discuss the difference between image and video segmentation. Due to the availability of motion and the redundancy of successive frames, video segmentation should actually be easier than image segmentation. However, recent evidence indicates the opposite: at least at the level of superpixel segmentation, image segmentation methodology is more advanced than what can be found in the video segmentation literature.

Organizers: Gerard Pons-Moll

  • Cordelia Schmid
  • MRC seminar room (0.A.03)

In the first part of our talk, we present an approach for large displacement optical flow. Optical flow computation is a key component in many computer vision systems designed for tasks such as action
detection or activity  recognition. Inspired by the large displacement optical flow of Brox and  Malik, our approach  DeepFlow  combines a novel matching algorithm with a variational approach . Our matching algorithm builds upon a multi-stage architecture interleaving convolutions and max-pooling.  DeepFlow efficiently handles large displacements  occurring in realistic videos, and shows competitive performance on optical flow benchmarks.

In the second part of our talk, we present a state-of-the-art approach  for action recognition based  on motion stabilized trajectory  descriptors and a Fisher vector representation.  We briefly review the recent trajectory-based video features and, then, introduce their motion stabilized version, combining human detection and dominant motion estimation. Fisher vectors summarize the information of a video efficiently. Results on several of the recent action datasets as well as the TrecVid MED dataset show that our approach outperforms the state-of-the-art

  • Jiri Matas
  • Max Planck House Lecture Hall

Computer vision problems often involve optimization of two quantities, one of which is time. Such problems can be formulated as time-constrained optimization or performance-constrained search for the fastest algorithm. We show that it is possible to obtain quasi-optimal time-constrained solutions to some vision problems by applying Wald's theory of sequential decision-making. Wald assumes independence of observation, which is rarely true in computer vision. We address the problem by combining Wald's sequential probability ratio test and AdaBoost. The solution, called the WaldBoost, can be viewed as a principled way to build a close-to-optimal “cascade of classifiers” of the Viola-Jones type. The approach will be demonstrated on four tasks: (i) face detection, (ii) establishing reliable correspondences between image, (iii) real-time detection of interest points and (iv) model search and outlier detection using RANSAC. In the face detection problem, the objective is learning the fastest detector satisfying constraints on false positive and false negative rates. The correspondence pruning addresses the problem of fast selection with a predefined false negative rated. In interest point problem we show how a fast implementation of known detectors can obtained by Waldboost. The “mimicked” detectors provide a training set of positive and negative examples of interest ponts and WaldBoost learns a detector, (significantly) faster than the providers of the training set, formed as a linear combination of efficiently computable feature. In RANSAC, we show how to exploit Wald's test in a randomised model verification procedure to obtain an algorithm significantly faster than deterministic verification yet with equivalent probabilistic guarantees of correctness.

Organizers: Gerard Pons-Moll

Scalable Surface-Based Stereo Matching

  • 10 April 2014 • 14:00:00
  • Daniel Scharstein
  • MRC seminar room (0.A.03)

Stereo matching -- establishing correspondences between images taken from nearby viewpoints -- is one of the oldest problems in computer vision.  While impressive progress has been made over the last two decades, most current stereo methods do not scale to the high-resolution images taken by today's cameras since they require searching the full space of all possible disparity hypotheses over all pixels.

In this talk I will describe a new scalable stereo method that only evaluates a small portion of the search space.  The method first generates plane hypotheses from matched sparse features, which are then refined into surface hypotheses using local slanted plane sweeps over a narrow disparity range.  Finally, each pixel is assigned to one of the local surface hypotheses. The technique achieves significant speedups over previous algorithms and achieves state-of-the-art accuracy on high-resolution stereo pairs of up to 19 megapixels.

I will also present a new dataset of high-resolution stereo pairs with subpixel-accurate ground truth, and provide a brief outlook on the upcoming new version of the Middlebury stereo benchmark.

Video-based Analysis of Humans and Their Behavior

  • 27 March 2014 • 14:00:00
  • Stan Sclaroff
  • MRC Seminar room (0.A.03)

This talk will give an overview of some of the research in the Image and Video Computing Group at Boston University related to image- and video-based analysis of humans and their behavior, including: tracking humans, localizing and classifying actions in space-time, exploiting contextual cues in action classification, estimating human pose from images, analyzing the communicative behavior of children in video, and sign language recognition and retrieval.

Collaborators in this work include (in alphabetical order): Vassilis Athitsos, Qinxun Bai, Margrit Betke, R. Gokberk Cinbis, Kun He, Nazli Ikizler-Cinbis, Hao Jiang, Liliana Lo Presti, Shugao Ma, Joan Nash, Carol Neidle, Agata Rozga, Tai-peng Tian, Ashwin Thangali, Zheng Wu, and Jianming Zhang.

Multi-View Perception of Dynamic Scenes

IS Colloquium
  • 20 March 2014 • 11:15:00 12:30
  • Edmond Boyer
  • Max Planck House Lecture Hall

The INRIA MORPHEO research team is working on the perception of moving shapes using multiple camera systems. Such systems allows to recover dense information on shapes and their motions using visual cues. This opens avenues for research investigations on how to model, understand and animate real dynamic shapes using several videos. In this talk I will more particularly focus on recent activities in the team on two fundamental components of the multi-view perception of dynamic scenes that are: (i) the recovery of time-consistent shape models or shape tracking and (ii) the segmentation of objects in multiple views and over time. 

Organizers: Gerard Pons-Moll

  • Prof. Yoshinari Kameda
  • MRC seminar room (0.A.03)

This talk presents our 3D video production method by which a user can watch a  real game from any free viewpoint. Players in the game are captured by 10 cameras and they are reproduced three dimensionally by billboard based representation in real time. Upon producing the 3D video, we have also worked on good user interface that can enable people move the camera intuitively. As the speaker is also working on wide variety of computer vision to augmented reality, selected recent works will be also introduced briefly.

Dr. Yoshinari Kameda started his research from human pose estimation as his Ph.D thesis, then he expands his interested topics from computer vision, human interface, and augmented reality.
He is now an associate professor at University of Tsukuba.
He is also a member of Center for Computational Science of U-Tsukuba where some outstanding super-computer s are in operation.
He served International Symposium on Mixed and Augmented Reality as a area chair for four years (2007-2010).