Since August 2017 I'm a visiting researcher at Facebook in Menlo Park, CA. My research interests are in motion estimation and video understanding. In particular I'm interested in exploring and modeling how the semantics and the motion of the scene are related.
Before joining Facebook I was a postdoc at Perceiving Systems from February 2015 to July 2017. Before, I got my PhD from UMass Amherst, where my advisor was Erik Learned-Miller. My thesis was on motion estimation under long displacements and large changes in the scene. During that time I was lucky to collaborate with great people at Adobe Research, where we created an application to video editing, and also here at MPI. I also interned at Apple, with the Voice Over team using computer vision to make their technology accessible for people who are visually impaired. Before, I got my masters from Brown University thanks to a fellowship from the Caja Madrid Foundation. In 2007 I got my bachelors in computer engineering from the University of Granada, in Spain, where I'm originally from.
In addition to basic research, I have a special interest in applications of technology for the greater good.
In the summer of 2015 I organized and taught a workshop in computer vision and robotics for advanced teenagers in the island of Camiguin (Philippines). Our goal was to empower children with basic knowledge of computer science, with the hope that they can use this knowledge for applications in their own environment. In this spirit, we did projects on species recognition, garbage collection and machine translation from English to Bisaya. See more on this video.
During the summer of 2009 I was lucky to intern at the Voice Over team at Apple, which makes their technology accessible for people with disabilities. Back at UMass and inspired by this experience, we made thisapp for recognition of American notes.
In German Conference on Pattern Recognition (GCPR), October 2018 (inproceedings)
Most of the top performing action recognition methods use optical flow as a "black box" input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: 1) optical flow is useful for action recognition because it is invariant to appearance, 2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, 3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, 4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and 5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.
Most of the top performing action recognition methods use optical flow as a black box input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: 1) optical flow is useful for action recognition because it is invariant to appearance, 2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, 3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, 4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and 5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.
In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 6911-6920, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)
The optical flow of natural scenes is a combination of the motion of the observer and the independent motion of objects. Existing algorithms typically focus on either recovering motion and structure under the assumption of a purely static world or optical flow for general unconstrained scenes. We combine these approaches in an optical flow algorithm that estimates an explicit segmentation of moving objects from appearance and physical constraints. In static regions we take advantage of strong constraints to
jointly estimate the camera motion and the 3D structure of the scene over multiple frames. This allows us to also regularize the structure instead of the motion. Our formulation uses a Plane+Parallax framework, which works even under small baselines, and reduces the motion estimation to a one-dimensional search problem, resulting in more accurate estimation. In moving regions the flow is treated as unconstrained, and computed with an existing optical flow method. The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art results on both the MPISintel and KITTI-2015 benchmarks.
In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 3889-3898, June 2016 (inproceedings)
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class.
Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on roads with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.
Long Range Motion Estimation and Applications, University of Massachusetts Amherst, University of Massachusetts Amherst, Febuary 2015 (phdthesis)
Finding correspondences between images underlies many computer vision problems, such as optical flow, tracking, stereovision and alignment. Finding these correspondences involves formulating a matching function and optimizing it. This optimization process is often gradient descent, which avoids exhaustive search, but relies on the assumption of being in the basin of attraction of the right local minimum. This is often the case when the displacement is small, and current methods obtain very accurate results for small motions. However, when the motion is large and the matching function is bumpy this assumption is less likely to be true. One traditional way of avoiding this abruptness is to smooth the matching function spatially by blurring the images. As the displacement becomes larger, the amount of blur required to smooth the matching function becomes also larger. This averaging of pixels leads to a loss of detail in the image. Therefore, there is a trade-off between the size of the objects that can be tracked and the displacement that can be captured.
In this thesis we address the basic problem of increasing the size of the basin of attraction in a matching function. We use an image descriptor called distribution fields (DFs). By blurring the images in DF space instead of in pixel space, we in- crease the size of the basin attraction with respect to traditional methods. We show competitive results using DFs both in object tracking and optical flow. Finally we demonstrate an application of capturing large motions for temporal video stitching.
In Computer Graphics Forum (Proceedings of EGSR), 34(4):99-107, 2015 (inproceedings)
Converting unconstrained video sequences into videos that loop seamlessly is an extremely challenging problem. In this work, we take the first steps towards automating this process by focusing on an important subclass of videos containing a single dominant foreground object. Our technique makes two novel contributions over previous work: first, we propose a correspondence-based similarity metric to automatically identify a good transition point in the video where the appearance and dynamics of the foreground are most consistent. Second, we develop a technique that aligns both the foreground and background about this transition point using a combination of global camera path planning and patch-based video morphing. We demonstrate that this allows us to create natural, compelling, loopy videos from a wide range of videos collected from the internet.
In Computer Vision – ECCV 2014, 8689, pages: 423-438, Lecture Notes in Computer Science, (Editors: D. Fleet and T. Pajdla and B. Schiele and T. Tuytelaars ), Springer International Publishing, September 2014 (inproceedings)
Large motions remain a challenge for current optical flow algorithms. Traditionally, large motions are addressed using multi-resolution representations like Gaussian pyramids. To deal with large displacements, many pyramid levels are needed and, if an object is small, it may be invisible at the highest levels. To address this we decompose images using a channel representation (CR) and replace the standard brightness constancy assumption with a descriptor constancy assumption. CRs can be seen as an over-segmentation of the scene into layers based on some image feature. If the appearance of a foreground object differs from the background then its descriptor will be different and they will be represented in different layers.We create a pyramid by smoothing these layers, without mixing foreground and background or losing small objects. Our method estimates more accurate flow than the baseline on the MPI-Sintel benchmark, especially for fast motions and near motion boundaries.
In British Machine Vision Conference (BMVC) , BMVA Press, September 2013 (inproceedings)
While region-based image alignment algorithms that use gradient descent can achieve
sub-pixel accuracy when they converge, their convergence depends on the smoothness of
the image intensity values. Image smoothness is often enforced through the use of multiscale
approaches in which images are smoothed and downsampled. Yet, these approaches
typically use fixed smoothing parameters which may be appropriate for some images
but not for others. Even for a particular image, the optimal smoothing parameters may
depend on the magnitude of the transformation. When the transformation is large, the
image should be smoothed more than when the transformation is small. Further, with
gradient-based approaches, the optimal smoothing parameters may change with each
iteration as the algorithm proceeds towards convergence.
We address convergence issues related to the choice of smoothing parameters by
deriving a Gauss-Newton gradient descent algorithm based on distribution fields (DFs)
and proposing a method to dynamically select smoothing parameters at each iteration.
DF and DF-like representations have previously been used in the context of tracking. In
this work we incorporate DFs into a full affine model for region-based alignment and
simultaneously search over parameterized sets of geometric and photometric transforms.
We use a probabilistic interpretation of DFs to select smoothing parameters at each step
in the optimization and show that this results in improved convergence rates.
In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, June 2012 (inproceedings)
Visual tracking of general objects often relies on the assumption that gradient descent of the alignment function will reach the global optimum. A common technique to smooth the objective function is to blur the image. However, blurring the image destroys image information, which can cause the target to be lost. To address this problem we introduce a method for building an image descriptor using distribution fields (DFs), a representation that allows smoothing the objective function without destroying information about pixel values. We present experimental evidence on the superiority of the width of the basin of attraction around the global optimum of DFs over other descriptors. DFs also allow the representation of uncertainty about the tracked object. This helps in disregarding outliers during tracking (like occlusions or small misalignments) without modeling them explicitly. Finally, this provides a convenient way to aggregate the observations of the object through time and maintain an updated model. We present a simple tracking algorithm that uses DFs and obtains state-of-the-art results on standard benchmarks.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems