The representation of data can be crucial for the success or failure of a Computer Vision algorithm and is central to a wide variety of downstream tasks.
Representation learning especially becomes crucial for higher dimensional spaces and in problem domains where sampling relevant data is difficult. CNNs operating on 3D data are often limited either in their depth or resolution.However, 3D voxel representation of objects is sparse. We developed OctNet [ ], a representation for deep learning with sparse 3D data. The main contribution of OctNet is a hierarchical partitioning of the space using octrees, which enables to assign memory allocation and computation to non-sparse parts of the data. By these means deeper networks can be trained, which in turn improves the accuracy.
For many tasks in computer vision and graphics 3D mesh represenations are heavily used. A common apporach to learn a 3D representation of meshes is learning a linear subspace or higher-order tensor generalizations. Both methods are limited by their linearity. We introduce a learnable representation. This representation is implemented as a Convolutinal Mesh Autoencoder (CoMA) [ ]. This method is shown to perform 50\% better than PCA models, which is used so far for state-of-the-art models for faces. The low dimensional latent space learned by the CoMA can be used for sampling 3D meshes of diverse facial expressions.
CNNs produce a lower resolution image with segmentations when used for semantic segmentation. Further, the representation learned by standard CNNs neglects some structured information between (super) pixels. Both problems are addressed in [ ], by inserting bilateral filtering between multiple feature-scales, between superpixels in an image.
Sometimes algorithms must be very fast, which encourage the use of very small representations, even using only a single filter. Different types of data or tasks require different bilateral filters. A generic filter like the Gaussian filter might limit the performance. In [ ] we propose to learn bilateral filters directly from the data. This method is shown to improve results on multiple tasks, for instance image upsampling.