Deep learning has brought rapid progress for many computer vision problems but current methods require large training datasets with annotated ground truth. Human annotators tend to be reasonably efficient for tasks like sparse 2D joint estimation, however annotation for other tasks like dense optical flow estimation or 3D pose estimation is intractable.
To make progress on these tasks, we exploit our 3D body models to generate synthetic training data. In early work, we showed that synthetic data was useful for evaluating optical flow (Middlebury [ ] and Sintel [ ]). Progress in computer graphics has enabled rendering of synthetic scenes and people and, while not completely realistic, the trends are clear -- the quality of such data will steadily improve. Synthetic rendering is appealing for creating training datasets, as it is easily scalable and automatically generates ground truth for a wide variety of problems such as 3D human joints, part segmentations, 3D pose, depth maps, optical flow, body shape, etc.
We focus on learning from synthetic data, using as realistic data as possible about humans, like their motion, body shapes, body textures and backgrounds. We create the SURREAL dataset (Synthetic hUmans foR REAL tasks) and learn deep models for depth estimation and body part segmentation for humans [ ]. While not fully realistic, we show that pre-training on this data is valuable and reduces the amount of labeled real data that is needed.
We further create the Human Optical-Flow dataset [ ] for learning optical flow of humans in motion. This uses motion capture sequences, processed by MoSh [ ], to produce realistic human optical flow.
Our current work focuses on extending synthetic rendering and inference to multiple people in a single image, for tasks like optical flow, 2D and 3D pose estimation. We further focus on rendering and reconstructing hand-object interactions with realistic hand shapes and poses, object shapes, textures, as well as realistic hand-object grasps. We then plan to extend synthetic data generation to more complex and realistic scenes to reduce the domain gap between real and synthetic data.