I am leading the Holistic Vision Group (HVG) in the Department of Perceiving Systems at the Max Planck Institute for Intelligent Systems, my group is funded by the DFG through the CRC 1233 on Robust Vision.
I am interested in the intersection between computer vision and machine learning with a focus on holistic visual scene understanding. In particular, I am interested in analyzing and modeling people in our complex visual scenes.
Offers:I am looking for highly motivated PhD student and PhD interns. I also have projects for bachelor and master thesis. If you are interested, please contact me direclty or send your application to email@example.com
New! We have one paper accepted to ACCV 2018 as oral presentation.
One paper accepted to ECCV 2018.
One paper accepted to BMVC 2018.
Our workon part-aligned bilinear representations for person re-identification is online.
Our work on human action segmentation in real time is online, and the code is available.
I will be an area chair for ACCV 2018.
I received anEarly career research grantto start my own research group at the Max Planck Instiute for Intelligent Systems and the University of Tübingen, details coming soon. I am looking for highly motivated PhD student and PhD interns!
I have successfully defended my PhD thesis "People Detection and Tracking in Crowded Scenes" on the 29th September 2017 at the Max Planck Institute for Informatics. Thesis Committee: Prof. Bernt Schiele, Prof. Michael Black, Prof. Luc Van Gool.
Winner of the CVPR 2017 Multi-Object Tracking Challenge (MOT17).
Four papers accepted at CVPR 2017!
DAGM MVTec Dissertation Award, 2018
Winner of the Multi-Object Tracking Challenge at CVPR 2017
Winner of the Multi-Object Tracking Challenge at ECCV 2016
BMVC Best Paper Award, 2012
Scholarship for excellence in academic performance RWTH Aachen 2009, 2010
SS 2016: High-Level Computer Vision, Saarland University, teaching assistant
SS 2015: High-Level Computer Vision, Saarland University, teaching assistant
SS 2013: High-Level Computer Vision, Saarland University, teaching assistant
Deep learning has brought rapid progress in computer vision in the recent years. However, training deep models in a supervised fasion requires big datasets with annotated ground truth. Human annotators tend to be reasonably efficient for tasks like sparse 2D joint estimation, however annotation for other tasks like dense optical flo...
Human behavior can be described at multiple levels. At the lowest level, we observe the 3D pose of the body over time. Poses can be organized into primitives that capture coordinated activity of different body parts. These further form more complex "actions" or "behaviors". Finally, underlying all of the abo...
People are often a central element of visual scenes. It has been a long-standing goal in computer vision to develop computational models that enable machines to detect crowds of people, analyze their motion and poses, infer their actions and reason about the consequences. Our research addresses a wide range of challen...
In Proceedings of the British Machine Vision Conference (BMVC), pages: 269, BMVA Press, September 2018 (inproceedings)
Parsing continuous human motion into meaningful segments plays an essential role in various applications. In this work, we propose a hierarchical dynamic clustering framework to derive action clusters from a sequence of local features in an unsuper- vised bottom-up manner. We systematically investigate the modules in this framework and particularly propose diverse temporal pooling schemes, in order to realize accurate temporal action localization. We demonstrate our method on two motion parsing tasks: temporal action segmentation and abnormal behavior detection. The experimental results indicate that the proposed framework is significantly more effective than the other related state-of-the-art methods on several datasets.
In European Conference on Computer Vision (ECCV), 11218, pages: 418-437, Springer, Cham, September 2018 (inproceedings)
Comparing the appearance of corresponding body parts is essential for person re-identification. However, body parts are frequently misaligned be- tween detected boxes, due to the detection errors and the pose/viewpoint changes. In this paper, we propose a network that learns a part-aligned representation for person re-identification. Our model consists of a two-stream network, which gen- erates appearance and body part feature maps respectively, and a bilinear-pooling layer that fuses two feature maps to an image descriptor. We show that it results in a compact descriptor, where the inner product between two image descriptors is equivalent to an aggregation of the local appearance similarities of the cor- responding body parts, and thereby significantly reduces the part misalignment problem. Our approach is advantageous over other pose-guided representations by learning part descriptors optimal for person re-identification. Training the net- work does not require any part annotation on the person re-identification dataset. Instead, we simply initialize the part sub-stream using a pre-trained sub-network of an existing pose estimation network and train the whole network to minimize the re-identification loss. We validate the effectiveness of our approach by demon- strating its superiority over the state-of-the-art methods on the standard bench- mark datasets including Market-1501, CUHK03, CUHK01 and DukeMTMC, and standard video dataset MARS.
We propose a novel end-to-end trainable framework for the graph decomposition problem. The minimum cost mul- ticut problem is first converted to an unconstrained binary cubic formulation where cycle consistency constraints are incorporated into the objective function. The new optimiza- tion problem can be viewed as a Conditional Random Field (CRF) in which the random variables are associated with the binary edge labels of the initial graph and the hard con- straints are introduced in the CRF as high-order potentials. The parameters of a standard Neural Network and the fully differentiable CRF are optimized in an end-to-end manner. Furthermore, our method utilizes the cycle constraints as meta-supervisory signals during the learning of the deep feature representations by taking the dependencies between the output random variables into account. We present analy- ses of the end-to-end learned representations, showing the impact of the joint training, on the task of clustering images of MNIST. We also validate the effectiveness of our approach both for the feature learning and the final clustering on the challenging task of real-world multi-person pose estimation
Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations in a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to other work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced with neither information loss nor extra computation. We perform intensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art work on various datasets.
We present an effective dynamic clustering algorithm for the task of temporal human action segmentation, which has comprehensive applications such as robotics, motion analysis, and patient monitoring. Our proposed algorithm is unsupervised, fast, generic to process various types of features, and applica- ble in both the online and offline settings. We perform extensive experiments of processing data streams, and show that our algorithm achieves the state-of- the-art results for both online and offline settings.
Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B., Schiele, B.
Articulated Multi-person Tracking in the Wild
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 1293-1301, IEEE, July 2017, Oral (inproceedings)
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 4929-4937, IEEE, June 2016 (inproceedings)
This paper considers the task of articulated human pose estimation of multiple people in real-world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other.
This joint formulation is in contrast to previous strategies, that address the problem by first detecting people and subsequently estimating their body pose. We propose a partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors. Our formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configurations of body parts respecting geometric and appearance constraints. Experiments on four different datasets demonstrate state-of-the-art results for both single person and multi person pose estimation.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems