Because we are typically given images, it is easy to think that the computer vision is about pixels. In Perceiving Systems, we think vision is about understanding the 3D world and its motion. Images result from light interacting with materials and surfaces as captured by a lens. Our goal is to formulate models of the world and then relate these to how the world appears in images, enabling detection, recognition, and rendering. These rich representations of the 3D world facilitate reasoning and provide a foundation for robotic vision systems that interact with this world.
This generative approach is at an inflection point. New sensors and methods allow the capture of 3D objects, full 3D scenes, materials, and even 4D shape (3D shape over time). Rendering engines are better, more realistic, and more open than ever. Large datasets enable learning of object and scene statistics. Deep networks give new modeling tools to capture non-linear properties of the world. The combination of generative models, data, and learning offers a path to solving hard vision problems. Our approach is highly interdisciplinary, integrating computer vision, machine learning, computer graphics, and computational neuroscience.
Our philosophy is to model the easy stuff and learn the rest. As an example, the distribution over the shapes of different cars is something that is hard to write down but can be learned. The projection of a car shape into the image, the motion of the car on the road, contact and interpenetration with other objects, and the appearance under different lighting conditions are all physical things that are relatively easy to model. We see this philosophy of learning and modeling throughout our work. Some examples:
Inverse rendering: A rendering engine takes 3D models, materials, and lighting and produces images of the scene. The goal of inverse rendering is to turn this around and infer the 3D scene that generated the image. To that end we have developed an approximate differentiable renderer that efficiently does this when one is close to the solution. We have also developed sampling methods to deal with more complex scenes and to represent distributions over solutions.
Human body shape and motion: Humans and animals have complex 3D shapes that vary across individuals, with pose, and with motion. Using 3D and 4D scans we learn the world's most accurate statistical models of detailed human body shape. We use inverse rendering to then estimate human shape and pose from a variety of sources including mocap markers, RGB-D sequences, and video.
Scene understanding: Scenes are composed of objects with a spatial layout. We expect different objects and different spatial relations in outdoor scenes, traffic scenes, homes, and offices. Our goal is to combine semantic information about scenes with 3D information about objects to infer what objects are present, where they are located and how they are moving.
Stereo and optical flow: Both stereo and optical flow give important information about the 3D structure of the scene and the location of surface boundaries. They are typically viewed as low-level problems that provide this structural information to higher-level processes. We take a different view. Knowing something about the scene and its objects can make stereo and flow estimation easier. Consequently we formulate these estimation problems jointly to describe images and sequences in terms of semantic primitives.
Intrinsic images: Between image pixels and the 3D world are intermediate representations that are registered with the image but relate to the physical world. Examples include depth, flow, albedo, shading, object contours, cast shadows, etc. Extracting these intermediate representations has long been a goal and is now becoming feasible. By taking an integrated approach to estimating these intrinsic properties over time we are able to extract fundamental physical properties of scenes from video.
Our work on these topics and others described on this website is basic research. That said, we are always looking to have an impact beyond academia and do this in several ways. Patenting and spin-offs is one direction. We also make code and data available open source or for license. Finally we are responsible for, or contribute to, some of the most widely used datasets and evaluation benchmarks in vision (Middlebury Flow, KITTI, Sintel, HumanEva). These help push the state of the art and provide a platform for industry to understand what works, how well, and why.