Supervised learning with deep convolutional networks is the workhorse of the majority of computer vision research today. While much progress has been made already, exploiting deep architectures with standard components, enormous datasets, and massive computational power, I will argue that it pays to scrutinize some of the components of modern deep networks. I will begin with looking at the common pooling operation and show how we can replace standard pooling layers with a perceptually-motivated alternative, with consistent gains in accuracy. Next, I will show how we can leverage self-similarity, a well known concept from the study of natural images, to derive non-local layers for various vision tasks that boost the discriminative power. Finally, I will present a lightweight approach to obtaining predictive probabilities in deep networks, allowing to judge the reliability of the prediction.
Biography: I am a professor and deputy chair of Computer Science at TU Darmstadt in Germany and lead the Visual Inference Lab. My research interests mainly lie in the areas of computer vision as well as machine learning and are focused on statistical models for problems of visual inference. Particular research interests include semantic scene understanding, image motion estimation, deep learning, probabilistic models of low-level vision, as well as people detection and tracking.