Vision is a crucial sense for computational systems to interact with their environments as biological systems do. A major task is interpreting images of complex scenes, by recognizing and localizing objects, persons and actions. This involves learning a large number of visual models, ideally autonomously.
In this talk I will present two ways of reducing the amount of human supervision required by this learning process. The first way is labeling images only by the object class they contain. Learning from cluttered images is very challenging in this weakly supervised setting. In the traditional paradigm, each class is learned starting from scratch. In our work instead, knowledge generic over classes is first learned during a meta-training stage from images of diverse classes with given object locations, and is then used to support learning any new class without location annotation. Generic knowledge helps because during meta-training the system can learn about localizing objects in general. As demonstrated experimentally, this approach enables learning from more challenging images than possible before, such as the PASCAL VOC 2007, containing extensive clutter and large scale and appearance variations between object instances.
The second way is the analysis of news items consisting of images and text captions. We associate names and action verbs in the captions to the face and body pose of the persons in the images. We introduce a joint probabilistic model for simultaneously recovering image-caption correspondences and learning appearance models for the face and pose classes occurring in the corpus. As demonstrated experimentally, this joint `face and pose' model solves the correspondence problem better than earlier models covering only the face.
I will conclude with an outlook on the idea of visual culture, where new visual concepts are learned incrementally on top of all visual knowledge acquired so far. Beside generic knowledge, visual culture includes also knowledge specific to a class, knowledge of scene structures and other forms of visual knowledge. Potentially, this approach could considerably extend current visual recognition capabilities and produce an integrated body of visual knowledge.