Computer vision is often treated as problem of pattern recognition, 3D reconstruction, or image processing. While these all play supporting roles, our view is that the goal of computer vision is to infer what is not in the picture. The goal is to recognize the unseen. This is different from the Aristotelian view that “vision is knowing what is where by looking.” We see vision as the process of inferring the causes and motivations behind the images that we observe; that is, we want to infer the story behind the picture.
The most interesting stories involve people. Consequently, our research focuses on understanding humans and their actions in the world. We aim to recover human behavior in detail, including human-human interactions, and human interactions with the environment.
Humans interact with each other and manipulate the world through their bodies, faces, hands and speech. If computers are to understand humans and our behavior, then they are going to have to understand much more about us than they currently do. For example, they need to recognize when we are picking up something heavy and might need help. They need to understand when we are distracted. They need to understand that changes in our behavior may signal medical or psychological changes.
To address this, we are developing the datasets, tools, models, and algorithms to recover human movement in unconstrained scenes at a level not previously possible. From single images or videos, we estimate full 3D body pose, including the motion of the face and the pose of the hands. We also recover the 3D structure of the world, its motion, and the objects in it so that human movement can be placed in context.
This is quite different from previous work in which the human body is treated in isolation, removed from the world around it, and 3D scene analysis happens on static scenes without humans. We see the interesting space as the one where people are present in, and interacting with, the 3D world. By building 3D models of people and how they move, we are able to place them in context and reason about the physics behind their behavior.
To advance this agenda, Perceiving Systems combines computer vision with machine learning and computer graphics. For example, our computer graphics models of the body enable us to generate training data for machine learning methods, which improve our computer vision algorithms. These improved algorithms give us better data with which to improve our graphics models, leading to a virtuous cycle.
This cycle is producing better and better virtual humans. We see the virtual human as more than a useful artifact. We see it as a testbed for evaluating our models of human behavior. If we can simulate a virtual human in a virtual world behaving in ways that are indistinguishable from a real human, then we assert that we have captured something about what it means to be human. This forces us to go beyond capturing human movement to modeling the causes of human movement.
We want to have an impact beyond the academic discipline of computer vision. Consequently, we develop applications in medicine and psychology in collaboration with medical colleagues. We have also spun off two companies that are using our 3D body model technology. One of these, Body Labs Inc., was acquired by Amazon in 2017. We also make code and data available open source or for license and our SMPL body model is now in wide use. Finally we are responsible for, or contribute to, widely used datasets and evaluation benchmarks that help push the state of the art and provide a platform for industry to understand what works, how well, and why.