This talk will highlight recent progress on two fronts. First, we will talk about a novel image-conditioned person model that allows for effective articulated pose estimation in realistic scenarios. Second, we describe our work towards activity recognition and the ability to describe video content with natural language.
Both efforts are part of a longer-term agenda towards visual scene understanding. While visual scene understanding has long been advocated as the "holy grail" of computer vision, we believe it is time to address this challenge again, based on the progress in recent years.