I'm currently a PhD student under Michael Black in the Department for Perceiving Systems at the MPI for Intelligent Systems in Tuebingen, Germany.
My main interest is motion. How does the world move? How does this motion manifest itself in a video, and how can we estimate it? And, once we estimate it, what does it tell us about the world and its temporal coherence, and how can a system use it to better understand and act in the world?
To answer these questions, my research focuses on model-driven optical flow estimation. This approach jointly reasons about the motion itself and additional effects related to motion (such as motion blur or the 3D geometry of a scene), in order to better constrain the motion estimation problem.
In German Conference on Pattern Recognition (GCPR), LNCS 11269, pages: 567-582, Springer, Cham, October 2018 (inproceedings)
The difficulty of annotating training data is a major obstacle to using CNNs for low-level tasks in video. Synthetic data often does not generalize to real videos, while unsupervised methods require heuristic n losses. Proxy tasks can overcome these issues, and start by training a network for a task for which annotation is easier or which can be trained unsupervised. The trained network is then fine-tuned for the original task using small amounts of ground truth data. Here, we investigate frame interpolation
as a proxy task for optical flow. Using real movies, we train a CNN unsupervised for temporal interpolation. Such a network implicitly estimates motion, but cannot handle untextured regions. By fine-tuning on small amounts of ground truth flow, the network can learn to fill in homogeneous regions and compute full optical flow fields. Using this unsupervised pre-training, our network outperforms similar architectures that were trained supervised using synthetic optical flow.
We address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow and segmentation of a video into the static scene and moving regions. Our key insight is that these four fundamental vision problems are coupled and, consequently, learning to solve them together simplifies the problem because the solutions can reinforce each other by exploiting known geometric constraints. In order to model geometric constraints, we introduce Adversarial Collaboration, a framework that facilitates competition and collaboration between neural networks. We go beyond previous work by exploiting geometry more explicitly and segmenting the scene into static and moving regions. Adversarial Collaboration works much like expectation-maximization but with neural networks that act as adversaries, competing to explain pixels that correspond to static or moving regions, and as collaborators through a moderator that assigns pixels to be either static or independently moving. Our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects. Our model is trained without any supervision and achieves state of the art results amongst unsupervised methods.
The estimation of motion in video sequences establishes temporal correspondences between pixels and surfaces and allows reasoning about a scene using multiple frames. Despite being a focus of research for over three decades, computing motion, or optical flow, remains challenging due to a number of difficulties, including the treatment of motion discontinuities and occluded regions, and the integration of information from more than two frames. One reason for these issues is that most optical flow algorithms only reason about the motion of pixels on the image plane, while not taking the image formation pipeline or the 3D structure of the world into account. One approach to address this uses layered models, which represent the occlusion structure of a scene and provide an approximation to the geometry. The goal of this dissertation is to show ways to inject additional knowledge about the scene into layered methods, making them more robust, faster, and more accurate. First, this thesis demonstrates the modeling power of layers using the example of motion blur in videos, which is caused by fast motion relative to the exposure time of the camera. Layers segment the scene into regions that move coherently while preserving their occlusion relationships. The motion of each layer therefore directly determines its motion blur. At the same time, the layered model captures complex blur overlap effects at motion discontinuities. Using layers, we can thus formulate a generative model for blurred video sequences, and use this model to simultaneously deblur a video and compute accurate optical flow for highly dynamic scenes containing motion blur. Next, we consider the representation of the motion within layers. Since, in a layered model, important motion discontinuities are captured by the segmentation into layers, the flow within each layer varies smoothly and can be approximated using a low dimensional subspace. We show how this subspace can be learned from training data using principal component analysis (PCA), and that flow estimation using this subspace is computationally efficient. The combination of the layered model and the low-dimensional subspace gives the best of both worlds, sharp motion discontinuities from the layers and computational efficiency from the subspace. Lastly, we show how layered methods can be dramatically improved using simple semantics. Instead of treating all layers equally, a semantic segmentation divides the scene into its static parts and moving objects. Static parts of the scene constitute a large majority of what is shown in typical video sequences; yet, in such regions optical flow is fully constrained by the depth structure of the scene and the camera motion. After segmenting out moving objects, we consider only static regions, and explicitly reason about the structure of the scene and the camera motion, yielding much better optical flow estimates. Furthermore, computing the structure of the scene allows to better combine information from multiple frames, resulting in high accuracies even in occluded regions. For moving regions, we compute the flow using a generic optical flow method, and combine it with the flow computed for the static regions to obtain a full optical flow field. By combining layered models of the scene with reasoning about the dynamic behavior of the real, three-dimensional world, the methods presented herein push the envelope of optical flow computation in terms of robustness, speed, and accuracy, giving state-of-the-art results on benchmarks and pointing to important future research directions for the estimation of motion in natural scenes.
In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 1406-1416, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.
In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 6911-6920, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)
The optical flow of natural scenes is a combination of the motion of the observer and the independent motion of objects. Existing algorithms typically focus on either recovering motion and structure under the assumption of a purely static world or optical flow for general unconstrained scenes. We combine these approaches in an optical flow algorithm that estimates an explicit segmentation of moving objects from appearance and physical constraints. In static regions we take advantage of strong constraints to
jointly estimate the camera motion and the 3D structure of the scene over multiple frames. This allows us to also regularize the structure instead of the motion. Our formulation uses a Plane+Parallax framework, which works even under small baselines, and reduces the motion estimation to a one-dimensional search problem, resulting in more accurate estimation. In moving regions the flow is treated as unconstrained, and computed with an existing optical flow method. The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art results on both the MPISintel and KITTI-2015 benchmarks.
In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2015), pages: 120-130, June 2015 (inproceedings)
We address the elusive goal of estimating optical flow both accurately and efficiently by adopting a sparse-to-dense approach. Given a set of sparse matches, we regress to dense optical flow using a learned set of full-frame basis
flow fields. We learn the principal components of natural flow fields using flow computed from four Hollywood
movies. Optical flow fields are then compactly approximated as a weighted sum of the basis flow fields. Our
new PCA-Flow algorithm robustly estimates these weights from sparse feature matches. The method runs in under
300ms/frame on the MPI-Sintel dataset using a single CPU and is more accurate and significantly faster than popular
methods such as LDOF and Classic+NL. The results, however, are too smooth for some applications. Consequently,
we develop a novel sparse layered flow method in which each layer is represented by PCA-flow. Unlike existing layered
methods, estimation is fast because it uses only sparse matches. We combine information from different layers into
a dense flow field using an image-aware MRF. The resulting PCA-Layers method runs in 3.6s/frame, is significantly
more accurate than PCA-flow and achieves state-of-the-art performance in occluded regions on MPI-Sintel.
In Computer Graphics Forum (Proceedings of EGSR), 34(4):99-107, 2015 (inproceedings)
Converting unconstrained video sequences into videos that loop seamlessly is an extremely challenging problem. In this work, we take the first steps towards automating this process by focusing on an important subclass of videos containing a single dominant foreground object. Our technique makes two novel contributions over previous work: first, we propose a correspondence-based similarity metric to automatically identify a good transition point in the video where the appearance and dynamics of the foreground are most consistent. Second, we develop a technique that aligns both the foreground and background about this transition point using a combination of global camera path planning and patch-based video morphing. We demonstrate that this allows us to create natural, compelling, loopy videos from a wide range of videos collected from the internet.
In Computer Vision – ECCV 2014, 8694, pages: 236-252, Lecture Notes in Computer Science, (Editors: D. Fleet and T. Pajdla and B. Schiele and T. Tuytelaars ), Springer International Publishing, September 2014 (inproceedings)
Videos contain complex spatially-varying motion blur due to the combination of object motion, camera motion, and depth variation with finite shutter speeds. Existing methods to estimate optical flow, deblur the images, and segment the scene fail in such cases. In particular, boundaries between differently moving objects cause problems, because here the blurred images are a combination of the blurred appearances of multiple surfaces. We address this with a novel layered model of scenes in motion. From a motion-blurred video sequence, we jointly estimate the layer segmentation and each layer's appearance and motion. Since the blur is a function of the layer motion and segmentation, it is completely determined by our generative model. Given a video, we formulate the optimization problem as minimizing the pixel error between the blurred frames and images synthesized from the model, and solve it using gradient descent. We demonstrate our approach on synthetic and real sequences.
Wulff, J., Black, M. J.Modeling Blurred Video with Layers
In Computer Vision – ECCV 2014, 8694, pages: 236-252, Lecture Notes in Computer Science, (Editors: D. Fleet and T. Pajdla and B. Schiele and T. Tuytelaars ), Springer International Publishing, September 2014 (inproceedings)
In IEEE Conf. on Computer Vision and Pattern Recognition, (CVPR 2013), pages: 2451-2458, Portland, OR, June 2013 (inproceedings)
Layered models allow scene segmentation and motion estimation to be formulated together and to inform one another. Traditional layered motion methods, however, employ fairly weak models of scene structure, relying on locally connected Ising/Potts models which have limited ability to capture long-range correlations in natural scenes. To address this, we formulate a fully-connected layered model that enables global reasoning about the complicated segmentations of real objects. Optimization with fully-connected graphical models is challenging, and our inference algorithm leverages recent work on efficient mean field updates for fully-connected conditional random fields. These methods can be implemented efficiently using high-dimensional Gaussian filtering. We combine these ideas with a layered flow model, and find that the long-range connections greatly improve segmentation into figure-ground layers when compared with locally connected MRF models. Experiments on several benchmark datasets show that the method can recover fine structures and large occlusion regions, with good flow accuracy and much lower computational cost than previous locally-connected layered models.
In European Conf. on Computer Vision (ECCV), pages: 611-625, Part IV, LNCS 7577, (Editors: A. Fitzgibbon et al. (Eds.)), Springer-Verlag, October 2012 (inproceedings)
Ground truth optical flow is difficult to measure in real scenes with natural motion. As a result, optical flow data sets are restricted in terms of size, complexity, and diversity, making optical flow algorithms difficult to train and test on realistic data. We introduce a new optical flow data set derived from the open source 3D animated short film Sintel. This data set has important features not present in the popular Middlebury flow evaluation: long sequences, large motions, specular reflections, motion blur, defocus blur, and atmospheric effects. Because the graphics data that generated the movie is open source, we are able to render scenes under conditions of varying complexity to evaluate where existing flow algorithms fail. We evaluate several recent optical flow algorithms and find that current highly-ranked methods on the Middlebury evaluation have difficulty with this more complex data set suggesting further research on optical flow estimation is needed. To validate the use of synthetic data, we compare the image- and flow-statistics of Sintel to those of real films and videos and show that they are similar. The data set, metrics, and evaluation website are publicly available.
Journal of Vision, 11(11):507-507, ARVO, September 2011 (article)
Estimating another person's gaze is a crucial skill in human social interactions. The social component is most apparent in dyadic gaze situations, in which the looker seems to look into the eyes of the observer, thereby signaling interest or a turn to speak. In a triadic situation, on the other hand, the looker's gaze is averted from the observer and directed towards another, specific target. This is mostly interpreted as a cue for joint attention, creating awareness of a predator or another point of interest. In keeping with the task's social significance, humans are very proficient at gaze estimation. Our accuracy ranges from less than one degree for dyadic settings to approximately 2.5 degrees for triadic ones. Our goal in this work is to draw inspiration from human gaze estimation mechanisms in order to create an artificial system that can approach the former's accuracy levels. Since human performance is severely impaired by both image-based degradations (Ando, 2004) and a change of facial configurations (Jenkins & Langton, 2003), the underlying principles are believed to be based both on simple image cues such as contrast/brightness distribution and on more complex geometric processing to reconstruct the actual shape of the head. By incorporating both kinds of cues in our system's design, we are able to surpass the accuracy of existing eye-tracking systems, which rely exclusively on either image-based or geometry-based cues (Yamazoe et al., 2008). A side-benefit of this combined approach is that it allows for gaze estimation despite moderate view-point changes. This is important for settings where subjects, say young children or certain kinds of patients, might not be fully cooperative to allow a careful calibration. Our model and implementation of gaze estimation opens up new experimental questions about human mechanisms while also providing a useful tool for general calibration-free, non-intrusive remote eye-tracking.
Journal of Vision, 11(11):800-800, ARVO, September 2011 (article)
Even 8–10 week old infants, when presented with two dynamic faces and a speech stream, look significantly longer at the ‘correct’ talking person (Patterson & Werker, 2003). This is true even though their reduced visual acuity prevents them from utilizing high spatial frequencies. Computational analyses in the field of audio/video synchrony and automatic speaker detection (e.g. Hershey & Movellan, 2000), in contrast, usually depend on high-resolution images. Therefore, the correlation mechanisms found in these computational studies are not directly applicable to the processes through which we learn to integrate the modalities of speech and vision. In this work, we investigated the correlation between speech signals and degraded video signals. We found a high correlation persisting even with high image degradation, resembling the low visual acuity of young infants. Additionally (in a fashion similar to Graf et al., 2002) we explored which parts of the face correlate with the audio in the degraded video sequences. Perfect synchrony and small offsets in the audio were used while finding the correlation, thereby detecting visual events preceding and following audio events. In order to achieve a sufficiently high temporal resolution, high-speed video sequences (500 frames per second) of talking people were used. This is a temporal resolution unachieved in previous studies and has allowed us to capture very subtle and short visual events. We believe that the results of this study might be interesting not only to vision researchers, but, by revealing subtle effects on a very fine timescale, also to people working in computer graphics and the generation and animation of artificial faces.
Wulff, J., Lotz, T., Stehle, T., Aach, T., Chase, J. G.
In Proc. SPIE, Proc. SPIE, (Editors: B. M. Dawant, D. R. Haynor), SPIE, 2011 (inproceedings)
The DIET (Digital Image Elasto Tomography) system is a novel approach to screen for breast cancer using only optical imaging information of the surface of a vibrating breast. 3D tracking of skin surface motion without the requirement of external markers is desirable. A novel approach to establish point correspondences using pure skin images is presented here. Instead of the intensity, motion is used as the primary feature, which can be extracted using optical flow algorithms. Taking sequences of multiple frames into account, this motion information alone is accurate and unambiguous enough to allow for a 3D reconstruction of the breast surface. Two approaches, direct and probabilistic, for this correspondence estimation are presented here, suitable for different levels of calibration information accuracy. Reconstructions show that the results obtained using these methods are comparable in accuracy to marker-based methods while considerably increasing resolution. The presented method has high potential in optical tissue deformation and motion sensing.
Stehle, T., Wulff, J., Behrens, A., Gross, S., Aach, T.
Fluorescence endoscopy is an emerging technique for the detection of bladder cancer. A marker substance is brought into the patient's bladder which accumulates at cancer tissue. If a suitable narrow band light source is used for illumination, a red fluorescence of the marker substance is observable. Because of the low fluorescence photon count and because of the narrow band light source, only a small amount of light is detected by the camera's CCD sensor. This, in turn, leads to strong noise in the recorded video sequence. To overcome this problem, we apply a temporal recursive filter to the video sequence. The derivation of a filter function is presented, which leads to an optimal filter in the minimum mean square error sense. The algorithm is implemented as plug-in for the real-time capable clinical demonstrator platform RealTimeFrame and it is capable to process color videos with a resolution of 768times576 pixels at 50 frames per second.
Stehle, T., Auer, R., Gross, S., Behrens, A., Wulff, J., Aach, T., Winograd, R., Trautwein, C., Tischendorf, J.
In Medical Imaging 2009: Computer-Aided Diagnosis, 7260, (Editors: N. Karssemeijer and M. L. Giger), SPIE, February 2009 (inproceedings)
The evolution of colon cancer starts with colon polyps. There are two different types of colon polyps, namely hyperplasias and adenomas. Hyperplasias are benign polyps which are known not to evolve into cancer and, therefore, do not need to be removed. By contrast, adenomas have a strong tendency to become malignant. Therefore, they have to be removed immediately via polypectomy. For this reason, a method to differentiate reliably adenomas from hyperplasias during a preventive medical endoscopy of the colon (colonoscopy) is highly desirable. A recent study has shown that it is possible to distinguish both types of polyps visually by means of their vascularization. Adenomas exhibit a large amount of blood vessel capillaries on their surface whereas hyperplasias show only few of them. In this paper, we show the feasibility of computer-based classification of colon polyps using vascularization features. The proposed classification algorithm consists of several steps: For the critical part of vessel segmentation, we implemented and compared two segmentation algorithms. After a skeletonization of the detected blood vessel candidates, we used the results as seed points for the Fast Marching algorithm which is used to segment the whole vessel lumen. Subsequently, features are computed from this segmentation which are then used to classify the polyps. In leave-one-out tests on our polyp database (56 polyps), we achieve a correct classification rate of approximately 90%.
Sinha, P., Balas, B., Ostrovsky, Y., Wulff, J.Visual Object Discovery
In Object Categorization: Computer and Human Vision Perspectives, pages: 301-323, (Editors: S. J. Dickinson, A. Leonardis, B. Schiele, M.J. Tarr), Cambridge University Press, 2009 (inbook)
Gross, S., Kennel, M., Stehle, T., Wulff, J., Tischendorf, J., Trautwein, C., Aach, T.
Endoscopic screening of the colon (colonoscopy) is performed to prevent cancer and to support therapy. During intervention colon polyps are located, inspected and, if need be, removed by the investigator. We propose a segmentation algorithm as a part of an automatic polyp classification system for colonoscopic Narrow-Band images. Our approach includes multi-scale filtering for noise reduction, suppression of small blood vessels, and enhancement of major edges. Results of the subsequent edge detection are compared to a set of elliptic templates and evaluated. We validated our algorithm on our polyp database with images acquired during routine colonoscopic examinations. The presented results show the reliable segmentation performance of our method and its robustness to image variations.
Stehle, T., Hennes, M., Gross, S., Behrens, A., Wulff, J., Aach, T.
In Bildverarbeitung für die Medizin 2009, pages: 142-146, Springer Berlin Heidelberg, 2009 (inproceedings)
Endoscopic images are strongly affected by lens distortion caused by the use of wide angle lenses. In case of endoscopy systems with exchangeable optics, e.g. in bladder endoscopy or sinus endoscopy, the camera sensor and the optics do not form a rigid system but they can be shifted and rotated with respect to each other during an examination. This flexibility has a major impact on the location of the distortion centre as it is moved along with the optics. In this paper, we describe an algorithm for the dynamic correction of lens distortion in cystoscopy which is based on a one time calibration. For the compensation, we combine a conventional static method for distortion correction with an algorithm to detect the position and the orientation of the elliptic field of view. This enables us to estimate the position of the distortion centre according to the relative movement of camera and optics. Therewith, a distortion correction for arbitrary rotation angles and shifts becomes possible without performing static calibrations for every possible combination of shifts and angles beforehand.
Journal of Vision, 7(9):315-315, ARVO, June 2007 (article)
The Gestalt laws (Wertheimer 1923) are widely regarded as the rules that help us parse the world into objects. However, it is unclear as to how these laws are acquired by an infant's visual system. Classically, these “laws” have been presumed to be innate (Kellman and Spelke 1983). But, more recent work in infant development, showing the protracted time-course over which these grouping principles emerge (e.g., Johnson and Aslin 1995; Craton 1996), suggests that visual experience might play a role in their genesis. Specifically, our studies of patients with late-onset vision (Project Prakash; VSS 2006) and evidence from infant development both point to an early role of common motion cues for object grouping. Here we explore the possibility that the privileged status of motion in the developmental timeline is not happenstance, but rather serves to bootstrap the learning of static Gestalt cues. Our approach involves computational analyses of real-world motion sequences to investigate whether primitive optic flow information is correlated with static figural cues that could eventually come to serve as proxies for grouping in the form of Gestalt principles.
We calculated local optic flow maps and then examined how similarity of motion across image patches co-varied with similarity of certain figural properties in static frames. Results indicate that patches with similar motion are much more likely to have similar luminance, color, and orientation as compared to patches with dissimilar motion vectors. This regularity suggests that, in principle, common motion extracted from dynamic visual experience can provide enough information to bootstrap region grouping based on luminance and color and contour continuation mechanisms in static scenes. These observations, coupled with the cited experimental studies, lend credence to the hypothesis that static Gestalt laws might be learned through a bootstrapping process based on early dynamic experience.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems